--- title: Your First Hurdle BCF Fit review-state: drafting last-human-review: "2026-06-04" depends-on: - src/pytyche/bcf/hurdle/model.py - src/pytyche/generators/api.py owner: tradcliffe quadrant: tutorial --- # Your First Hurdle BCF Fit *Context: this tutorial assumes the [overview](../concepts/overview.md), which explains what pytyche is for and the kind of experiment it is designed around.* This tutorial walks you from install through fitting a joint hurdle BCF on a small synthetic adaptive-enrichment dataset and reading the posterior at segment-GATE granularity. The deliverable: by the end you will have the posterior mean and 95% credible interval for the segment-level group average treatment effect (GATE) on revenue-per-visitor, and you'll see the model recover a planted "responders vs non-responders" split. The complete tutorial runs end-to-end on JAX-CPU in well under two minutes. Every code block in this page is executed by the doc test suite, so what you see here is what the library does today. ## The terms, in one paragraph If you're new to the vocabulary, here is everything you need before the code: - **Zero-inflation.** Revenue-per-visitor is mostly zeros: the large majority of visitors don't convert and spend nothing, and a minority convert and spend a positive amount. A single model that tries to fit "average revenue" directly fights that wall of zeros and ends up estimating neither piece well. - **The hurdle split.** A *hurdle* model factors the outcome into two questions that are cleaner on their own: **did the visitor convert?** (a yes/no *conversion* channel) and, **given they converted, how much did they spend?** (a positive-valued *severity* channel). Revenue-per- visitor is the product — conversion probability × expected spend given conversion. Modeling the two channels separately and composing them is what lets the model learn from the rare converters without being swamped by the zeros. - **BCF** — *Bayesian Causal Forest*. A tree-ensemble model that estimates how a treatment *changes* an outcome, with full posterior uncertainty (credible intervals), and that separates the baseline outcome surface from the treatment-effect surface. **Joint hurdle BCF** runs that machinery on both hurdle channels at once with shared tree structure. - **CATE / GATE.** The **CATE** (conditional average treatment effect) is the treatment effect *for a given visitor* — how much treatment shifts their expected revenue. The **GATE** (group average treatment effect) is the CATE averaged over a *group* of visitors — here, a segment. Pytyche is built to deliver honest *GATEs* (segment-level effects you can act on), not precise per-visitor CATEs, which are noisy at realistic sample sizes. With those in hand: this tutorial fits a joint hurdle BCF and reads the per-segment GATE on revenue-per-visitor. ## What you will do 1. Install pytyche. 2. Generate a 800-visitor synthetic dataset with two segments — one where treatment lifts revenue, one where it does nothing. 3. Fit `fit_hurdle_bcf` at small sizing. 4. Read the posterior: overall lift, then per-segment GATE with a 95% credible interval, and compare each to the planted ground truth. ## Install Until the PyPI release lands, install from source (see [Installation](../getting-started/installation.md) for details): ```bash git clone --recurse-submodules https://gitlab.com/tradcliffe2/tyche cd tyche uv sync --all-extras ``` Once published, the one-liner will be `uv add 'pytyche[gpu]'` (CUDA 12), or `uv add pytyche` for the CPU-only build — fine for this tutorial, which runs on CPU. ```{testsetup} import os os.environ["JAX_PLATFORMS"] = "cpu" ``` ## Generate a small adaptive-enrichment dataset Pytyche ships a built-in generator at `pytyche.generate` (re-exported from `pytyche.generators.api`) that produces a two-variant experiment (`control` vs `treatment`) with planted per-segment effects and analytical ground truth. We will plant a clear "responders" segment (40% of the population, treatment lifts conversion by +10pp and shifts log-AOV by +0.15) and a "non-responders" segment (60% of the population, treatment effect is zero on both channels). ```{testcode} from pytyche import generate SEGMENTS = { "responders": { "pct": 0.4, "base_conv": 0.08, "treatment_effect": 0.10, "aov_mu": 3.5, "aov_sigma": 0.5, "treatment_aov_mu_shift": 0.15, }, "non_responders": { "pct": 0.6, "base_conv": 0.06, "treatment_effect": 0.0, "aov_mu": 3.3, "aov_sigma": 0.5, "treatment_aov_mu_shift": 0.0, }, } bundle = generate( n_visitors=800, segments=SEGMENTS, metric="revenue_per_visitor", seed=0, ) ``` `bundle` is a `CalibrationBundle` — a typed pair of observed data and ground truth. The runner of a calibration sweep would unpack the bundle and pass only `bundle.observed` to the analyzer; for this tutorial we have both halves so we can compare the fit against the planted truth. The data layout the model wants is four NumPy arrays: a covariate matrix `X`, a treatment indicator `Z`, the outcome `Y_rev` (zero for non-converters, revenue for converters), and propensity scores. For this tutorial the only feature we use is the segment indicator, encoded as a single integer column. ```{testcode} import numpy as np import pandas as pd control_df = bundle.observed.variants[0].visitors treatment_df = bundle.observed.variants[1].visitors visitors = pd.concat([control_df, treatment_df], ignore_index=True) seg_to_idx = {name: i for i, name in enumerate(SEGMENTS)} X = visitors["segment"].map(seg_to_idx).to_numpy().reshape(-1, 1).astype(np.float32) Z = (visitors["variant"] == "treatment").to_numpy().astype(np.float32) Y_rev = visitors["revenue"].to_numpy().astype(np.float32) propensity = np.full(len(visitors), 0.5, dtype=np.float32) ``` The dataset is small by design: roughly 800 visitors, of which only about 60-80 are converters. Hurdle BCF was built for exactly this shape — most rows zero, a handful of rows carrying the lift signal. ## Fit the joint hurdle BCF `fit_hurdle_bcf` is the canonical entry point. It runs a joint shared-tree hurdle model: two forests (a prognostic `mu` forest and a treatment-effect `tau` forest), each with shared tree structure but separate leaf values for the conversion channel and the severity channel. Grow / prune proposals are accepted using the joint hurdle log-marginal likelihood. For the tutorial we deliberately set the sizing knobs small. These are not production values — they are tuned to finish on JAX-CPU in well under two minutes so the doc test suite can run the tutorial on every PR. The "GPU recommended for larger problems" callout above applies to production-sized problems, not this example. ```{testcode} from pytyche import GPUBCFConfig, fit_hurdle_bcf config = GPUBCFConfig( num_burnin=40, num_mcmc=80, num_trees_mu=30, num_trees_tau=15, max_depth=4, num_gfr_sweeps=2, diagnostic_interval=20, random_seed=0, ) result = fit_hurdle_bcf(X, Z, Y_rev, propensity, config) ``` `result` is a `HurdleBCFResult`. The field that matters for posterior interpretation is `rpv_cate_samples`: an `(n, S)` array of posterior draws of the per-visitor revenue-per-visitor CATE, where `S = num_mcmc / thin_factor * num_chains` (here `S = 80`). ## Read the posterior The first thing to look at is the overall posterior — does the model think there is, on average, a lift? Marginalizing over visitors and draws gives the population-level effect. ```{testcode} overall_posterior_per_draw = result.rpv_cate_samples.mean(axis=0) # (S,) overall_mean = overall_posterior_per_draw.mean() overall_lo = np.quantile(overall_posterior_per_draw, 0.025) overall_hi = np.quantile(overall_posterior_per_draw, 0.975) print(f"Overall posterior RPV lift: mean={overall_mean:.3f} " f"(95% CI: {overall_lo:.3f} - {overall_hi:.3f}) | " f"true={bundle.truth.effect:.3f}") ``` The 95% credible interval here is over draws, not over visitors. It is the posterior uncertainty about *the* population-level lift, not about any individual visitor. The point of an adaptive-enrichment design, though, is not the average — it is the *segment-level* GATE. Pytyche is built around segment-level inference (see [overview](../concepts/overview.md) §"Segment- level GATE focus"); the posterior interpretation step that matters is per-segment. ```{testcode} segment_array = visitors["segment"].to_numpy() cate_per_draw = result.rpv_cate_samples # (n, S) print(f"\n{'segment':<16} {'post mean':>10} {'2.5%':>10} {'97.5%':>10} {'truth':>10}") print("-" * 60) for seg_name in SEGMENTS: mask = segment_array == seg_name # Per-draw segment GATE: average CATE over visitors in the segment, # then summarize across draws. seg_gates = cate_per_draw[mask].mean(axis=0) # (S,) true_seg_cate = float(bundle.truth.cate_per_visitor.values[mask][0]) print(f"{seg_name:<16} {seg_gates.mean():>10.3f} " f"{np.quantile(seg_gates, 0.025):>10.3f} " f"{np.quantile(seg_gates, 0.975):>10.3f} " f"{true_seg_cate:>10.3f}") assert np.all(np.isfinite(cate_per_draw)), "posterior must be finite" assert cate_per_draw.shape == (len(visitors), config.num_mcmc) ``` At this small sample size and small MCMC budget, the posterior is wide, but the structural answer is right: the `responders` segment posterior mass sits well above zero, and the `non_responders` 95% credible interval contains zero. That separation is the deliverable — the model discriminates between segments where treatment helps and segments where it doesn't. What you should *not* read off the output is a precise point estimate for each visitor's individual CATE. Per-row CATEs from the joint hurdle BCF are noisy at this sample size; the library is calibrated to deliver honest segment-level GATEs, not individual-level point predictions. The [overview](../concepts/overview.md) doc explains why this is the right level of inferential ambition. ## Scaling considerations The sizing knobs above (`num_burnin=40`, `num_mcmc=80`, `num_trees_mu=30`, `num_trees_tau=15`) are tutorial-grade. Production adaptive-enrichment fits use larger sample sizes per round (thousands to tens of thousands of visitors), more trees (typical defaults are `num_trees_mu=200`, `num_trees_tau=50`), longer MCMC chains (`num_burnin=200`, `num_mcmc=200` or more), and multiple chains for between-chain convergence diagnostics. Those settings are GPU territory — the JAX kernels in `pytyche.bcf` are written for CUDA and become impractical on CPU as the problem grows. For benchmark-grade timing claims, see the artifacts in `bench/` rather than trusting prose. ## Where to go next - **[Your first adaptive experiment](first-adaptive-experiment.md)** — run the full multi-round design with Thompson allocation and controls retention; the product's main loop, end to end. - **[Working with the posterior](working-with-the-posterior.md)** — the reference companion for everything a fitted posterior can do: segmentation, allocation, decision support. - **[BCF posterior calibration at scale](../concepts/bcf-calibration-at-scale.md)** — why raw credible intervals are too narrow at large n, and how recalibration corrects them. The hands-on calibration walkthrough is not yet written; this concept page is the current reference. - **[Intended use](../concepts/overview.md)** — if you haven't read the overview yet, this is the right time.