---
title: Your First Hurdle BCF Fit
review-state: drafting
last-human-review: "2026-06-04"
depends-on:
  - src/pytyche/bcf/hurdle/model.py
  - src/pytyche/generators/api.py
owner: tradcliffe
quadrant: tutorial
---

# Your First Hurdle BCF Fit

*Context: this tutorial assumes the [overview](../concepts/overview.md),
which explains what pytyche is for and the kind of experiment it is
designed around.*

This tutorial walks you from install through fitting a joint hurdle BCF
on a small synthetic adaptive-enrichment dataset and reading the posterior
at segment-GATE granularity. The deliverable: by the end you will have
the posterior mean and 95% credible interval for the segment-level group
average treatment effect (GATE) on revenue-per-visitor, and you'll see
the model recover a planted "responders vs non-responders" split.

The complete tutorial runs end-to-end on JAX-CPU in well under two
minutes. Every code block in this page is executed by the doc test
suite, so what you see here is what the library does today.

## The terms, in one paragraph

If you're new to the vocabulary, here is everything you need before the
code:

- **Zero-inflation.** Revenue-per-visitor is mostly zeros: the large
  majority of visitors don't convert and spend nothing, and a minority
  convert and spend a positive amount. A single model that tries to fit
  "average revenue" directly fights that wall of zeros and ends up
  estimating neither piece well.
- **The hurdle split.** A *hurdle* model factors the outcome into two
  questions that are cleaner on their own: **did the visitor convert?**
  (a yes/no *conversion* channel) and, **given they converted, how much
  did they spend?** (a positive-valued *severity* channel). Revenue-per-
  visitor is the product — conversion probability × expected spend given
  conversion. Modeling the two channels separately and composing them is
  what lets the model learn from the rare converters without being
  swamped by the zeros.
- **BCF** — *Bayesian Causal Forest*. A tree-ensemble model that
  estimates how a treatment *changes* an outcome, with full posterior
  uncertainty (credible intervals), and that separates the baseline
  outcome surface from the treatment-effect surface. **Joint hurdle
  BCF** runs that machinery on both hurdle channels at once with shared
  tree structure.
- **CATE / GATE.** The **CATE** (conditional average treatment effect)
  is the treatment effect *for a given visitor* — how much treatment
  shifts their expected revenue. The **GATE** (group average treatment
  effect) is the CATE averaged over a *group* of visitors — here, a
  segment. Pytyche is built to deliver honest *GATEs* (segment-level
  effects you can act on), not precise per-visitor CATEs, which are noisy
  at realistic sample sizes.

With those in hand: this tutorial fits a joint hurdle BCF and reads the
per-segment GATE on revenue-per-visitor.

## What you will do

1. Install pytyche.
2. Generate a 800-visitor synthetic dataset with two segments — one
   where treatment lifts revenue, one where it does nothing.
3. Fit `fit_hurdle_bcf` at small sizing.
4. Read the posterior: overall lift, then per-segment GATE with a 95%
   credible interval, and compare each to the planted ground truth.

## Install

Until the PyPI release lands, install from source (see
[Installation](../getting-started/installation.md) for details):

```bash
git clone --recurse-submodules https://gitlab.com/tradcliffe2/tyche
cd tyche
uv sync --all-extras
```

Once published, the one-liner will be `uv add 'pytyche[gpu]'` (CUDA 12), or
`uv add pytyche` for the CPU-only build — fine for this tutorial, which runs
on CPU.

```{testsetup}
import os
os.environ["JAX_PLATFORMS"] = "cpu"
```

## Generate a small adaptive-enrichment dataset

Pytyche ships a built-in generator at `pytyche.generate` (re-exported from `pytyche.generators.api`) that produces a
two-variant experiment (`control` vs `treatment`) with planted per-segment
effects and analytical ground truth. We will plant a clear "responders"
segment (40% of the population, treatment lifts conversion by +10pp and
shifts log-AOV by +0.15) and a "non-responders" segment (60% of the
population, treatment effect is zero on both channels).

```{testcode}
from pytyche import generate

SEGMENTS = {
    "responders": {
        "pct": 0.4,
        "base_conv": 0.08,
        "treatment_effect": 0.10,
        "aov_mu": 3.5,
        "aov_sigma": 0.5,
        "treatment_aov_mu_shift": 0.15,
    },
    "non_responders": {
        "pct": 0.6,
        "base_conv": 0.06,
        "treatment_effect": 0.0,
        "aov_mu": 3.3,
        "aov_sigma": 0.5,
        "treatment_aov_mu_shift": 0.0,
    },
}

bundle = generate(
    n_visitors=800,
    segments=SEGMENTS,
    metric="revenue_per_visitor",
    seed=0,
)
```

`bundle` is a `CalibrationBundle` — a typed pair of observed data and
ground truth. The runner of a calibration sweep would unpack the bundle
and pass only `bundle.observed` to the analyzer; for this tutorial we
have both halves so we can compare the fit against the planted truth.

The data layout the model wants is four NumPy arrays: a covariate matrix
`X`, a treatment indicator `Z`, the outcome `Y_rev` (zero for
non-converters, revenue for converters), and propensity scores. For
this tutorial the only feature we use is the segment indicator, encoded
as a single integer column.

```{testcode}
import numpy as np
import pandas as pd

control_df = bundle.observed.variants[0].visitors
treatment_df = bundle.observed.variants[1].visitors
visitors = pd.concat([control_df, treatment_df], ignore_index=True)

seg_to_idx = {name: i for i, name in enumerate(SEGMENTS)}
X = visitors["segment"].map(seg_to_idx).to_numpy().reshape(-1, 1).astype(np.float32)
Z = (visitors["variant"] == "treatment").to_numpy().astype(np.float32)
Y_rev = visitors["revenue"].to_numpy().astype(np.float32)
propensity = np.full(len(visitors), 0.5, dtype=np.float32)
```

The dataset is small by design: roughly 800 visitors, of which only
about 60-80 are converters. Hurdle BCF was built for exactly this
shape — most rows zero, a handful of rows carrying the lift signal.

## Fit the joint hurdle BCF

`fit_hurdle_bcf` is the canonical entry point. It runs a joint
shared-tree hurdle model: two forests (a prognostic `mu` forest and a
treatment-effect `tau` forest), each with shared tree structure but
separate leaf values for the conversion channel and the severity
channel. Grow / prune proposals are accepted using the joint hurdle
log-marginal likelihood.

For the tutorial we deliberately set the sizing knobs small. These are
not production values — they are tuned to finish on JAX-CPU in well
under two minutes so the doc test suite can run the tutorial on every
PR. The "GPU recommended for larger problems" callout above applies to
production-sized problems, not this example.

```{testcode}
from pytyche import GPUBCFConfig, fit_hurdle_bcf

config = GPUBCFConfig(
    num_burnin=40,
    num_mcmc=80,
    num_trees_mu=30,
    num_trees_tau=15,
    max_depth=4,
    num_gfr_sweeps=2,
    diagnostic_interval=20,
    random_seed=0,
)

result = fit_hurdle_bcf(X, Z, Y_rev, propensity, config)
```

`result` is a `HurdleBCFResult`. The field that matters for posterior
interpretation is `rpv_cate_samples`: an `(n, S)` array of
posterior draws of the per-visitor revenue-per-visitor CATE, where `S
= num_mcmc / thin_factor * num_chains` (here `S = 80`).

## Read the posterior

The first thing to look at is the overall posterior — does the model
think there is, on average, a lift? Marginalizing over visitors and
draws gives the population-level effect.

```{testcode}
overall_posterior_per_draw = result.rpv_cate_samples.mean(axis=0)  # (S,)
overall_mean = overall_posterior_per_draw.mean()
overall_lo = np.quantile(overall_posterior_per_draw, 0.025)
overall_hi = np.quantile(overall_posterior_per_draw, 0.975)

print(f"Overall posterior RPV lift: mean={overall_mean:.3f} "
      f"(95% CI: {overall_lo:.3f} - {overall_hi:.3f}) | "
      f"true={bundle.truth.effect:.3f}")
```

The 95% credible interval here is over draws, not over visitors. It is
the posterior uncertainty about *the* population-level lift, not about
any individual visitor.

The point of an adaptive-enrichment design, though, is not the average
— it is the *segment-level* GATE. Pytyche is built around segment-level
inference (see [overview](../concepts/overview.md) §"Segment-
level GATE focus"); the posterior interpretation step that matters is
per-segment.

```{testcode}
segment_array = visitors["segment"].to_numpy()
cate_per_draw = result.rpv_cate_samples  # (n, S)

print(f"\n{'segment':<16} {'post mean':>10} {'2.5%':>10} {'97.5%':>10} {'truth':>10}")
print("-" * 60)
for seg_name in SEGMENTS:
    mask = segment_array == seg_name
    # Per-draw segment GATE: average CATE over visitors in the segment,
    # then summarize across draws.
    seg_gates = cate_per_draw[mask].mean(axis=0)  # (S,)
    true_seg_cate = float(bundle.truth.cate_per_visitor.values[mask][0])
    print(f"{seg_name:<16} {seg_gates.mean():>10.3f} "
          f"{np.quantile(seg_gates, 0.025):>10.3f} "
          f"{np.quantile(seg_gates, 0.975):>10.3f} "
          f"{true_seg_cate:>10.3f}")

assert np.all(np.isfinite(cate_per_draw)), "posterior must be finite"
assert cate_per_draw.shape == (len(visitors), config.num_mcmc)
```

At this small sample size and small MCMC budget, the posterior is wide,
but the structural answer is right: the `responders` segment posterior
mass sits well above zero, and the `non_responders` 95% credible
interval contains zero. That separation is the deliverable — the model
discriminates between segments where treatment helps and segments where
it doesn't.

What you should *not* read off the output is a precise point estimate
for each visitor's individual CATE. Per-row CATEs from the joint
hurdle BCF are noisy at this sample size; the library is calibrated
to deliver honest segment-level GATEs, not individual-level point
predictions. The
[overview](../concepts/overview.md) doc explains why this is
the right level of inferential ambition.

## Scaling considerations

The sizing knobs above (`num_burnin=40`, `num_mcmc=80`,
`num_trees_mu=30`, `num_trees_tau=15`) are tutorial-grade. Production
adaptive-enrichment fits use larger sample sizes per round (thousands
to tens of thousands of visitors), more trees (typical defaults are
`num_trees_mu=200`, `num_trees_tau=50`), longer MCMC chains
(`num_burnin=200`, `num_mcmc=200` or more), and multiple chains for
between-chain convergence diagnostics. Those settings are GPU
territory — the JAX kernels in
`pytyche.bcf` are written for CUDA and become impractical on CPU as
the problem grows. For benchmark-grade timing claims, see the
artifacts in `bench/` rather than trusting prose.

## Where to go next

- **[Your first adaptive experiment](first-adaptive-experiment.md)**
  — run the full multi-round design with Thompson allocation and
  controls retention; the product's main loop, end to end.
- **[Working with the posterior](working-with-the-posterior.md)** —
  the reference companion for everything a fitted posterior can do:
  segmentation, allocation, decision support.
- **[BCF posterior calibration at scale](../concepts/bcf-calibration-at-scale.md)**
  — why raw credible intervals are too narrow at large n, and how
  recalibration corrects them. The hands-on calibration walkthrough is
  not yet written; this concept page is the current reference.
- **[Intended use](../concepts/overview.md)** — if you haven't read
  the overview yet, this is the right time.