Your First Hurdle BCF Fit

Context: this tutorial assumes the overview, which explains what pytyche is for and the kind of experiment it is designed around.

This tutorial walks you from install through fitting a joint hurdle BCF on a small synthetic adaptive-enrichment dataset and reading the posterior at segment-GATE granularity. The deliverable: by the end you will have the posterior mean and 95% credible interval for the segment-level group average treatment effect (GATE) on revenue-per-visitor, and you’ll see the model recover a planted “responders vs non-responders” split.

The complete tutorial runs end-to-end on JAX-CPU in well under two minutes. Every code block in this page is executed by the doc test suite, so what you see here is what the library does today.

The terms, in one paragraph

If you’re new to the vocabulary, here is everything you need before the code:

  • Zero-inflation. Revenue-per-visitor is mostly zeros: the large majority of visitors don’t convert and spend nothing, and a minority convert and spend a positive amount. A single model that tries to fit “average revenue” directly fights that wall of zeros and ends up estimating neither piece well.

  • The hurdle split. A hurdle model factors the outcome into two questions that are cleaner on their own: did the visitor convert? (a yes/no conversion channel) and, given they converted, how much did they spend? (a positive-valued severity channel). Revenue-per- visitor is the product — conversion probability × expected spend given conversion. Modeling the two channels separately and composing them is what lets the model learn from the rare converters without being swamped by the zeros.

  • BCFBayesian Causal Forest. A tree-ensemble model that estimates how a treatment changes an outcome, with full posterior uncertainty (credible intervals), and that separates the baseline outcome surface from the treatment-effect surface. Joint hurdle BCF runs that machinery on both hurdle channels at once with shared tree structure.

  • CATE / GATE. The CATE (conditional average treatment effect) is the treatment effect for a given visitor — how much treatment shifts their expected revenue. The GATE (group average treatment effect) is the CATE averaged over a group of visitors — here, a segment. Pytyche is built to deliver honest GATEs (segment-level effects you can act on), not precise per-visitor CATEs, which are noisy at realistic sample sizes.

With those in hand: this tutorial fits a joint hurdle BCF and reads the per-segment GATE on revenue-per-visitor.

What you will do

  1. Install pytyche.

  2. Generate a 800-visitor synthetic dataset with two segments — one where treatment lifts revenue, one where it does nothing.

  3. Fit fit_hurdle_bcf at small sizing.

  4. Read the posterior: overall lift, then per-segment GATE with a 95% credible interval, and compare each to the planted ground truth.

Install

Until the PyPI release lands, install from source (see Installation for details):

git clone --recurse-submodules https://gitlab.com/tradcliffe2/tyche
cd tyche
uv sync --all-extras

Once published, the one-liner will be uv add 'pytyche[gpu]' (CUDA 12), or uv add pytyche for the CPU-only build — fine for this tutorial, which runs on CPU.

Generate a small adaptive-enrichment dataset

Pytyche ships a built-in generator at pytyche.generate (re-exported from pytyche.generators.api) that produces a two-variant experiment (control vs treatment) with planted per-segment effects and analytical ground truth. We will plant a clear “responders” segment (40% of the population, treatment lifts conversion by +10pp and shifts log-AOV by +0.15) and a “non-responders” segment (60% of the population, treatment effect is zero on both channels).

from pytyche import generate

SEGMENTS = {
    "responders": {
        "pct": 0.4,
        "base_conv": 0.08,
        "treatment_effect": 0.10,
        "aov_mu": 3.5,
        "aov_sigma": 0.5,
        "treatment_aov_mu_shift": 0.15,
    },
    "non_responders": {
        "pct": 0.6,
        "base_conv": 0.06,
        "treatment_effect": 0.0,
        "aov_mu": 3.3,
        "aov_sigma": 0.5,
        "treatment_aov_mu_shift": 0.0,
    },
}

bundle = generate(
    n_visitors=800,
    segments=SEGMENTS,
    metric="revenue_per_visitor",
    seed=0,
)

bundle is a CalibrationBundle — a typed pair of observed data and ground truth. The runner of a calibration sweep would unpack the bundle and pass only bundle.observed to the analyzer; for this tutorial we have both halves so we can compare the fit against the planted truth.

The data layout the model wants is four NumPy arrays: a covariate matrix X, a treatment indicator Z, the outcome Y_rev (zero for non-converters, revenue for converters), and propensity scores. For this tutorial the only feature we use is the segment indicator, encoded as a single integer column.

import numpy as np
import pandas as pd

control_df = bundle.observed.variants[0].visitors
treatment_df = bundle.observed.variants[1].visitors
visitors = pd.concat([control_df, treatment_df], ignore_index=True)

seg_to_idx = {name: i for i, name in enumerate(SEGMENTS)}
X = visitors["segment"].map(seg_to_idx).to_numpy().reshape(-1, 1).astype(np.float32)
Z = (visitors["variant"] == "treatment").to_numpy().astype(np.float32)
Y_rev = visitors["revenue"].to_numpy().astype(np.float32)
propensity = np.full(len(visitors), 0.5, dtype=np.float32)

The dataset is small by design: roughly 800 visitors, of which only about 60-80 are converters. Hurdle BCF was built for exactly this shape — most rows zero, a handful of rows carrying the lift signal.

Fit the joint hurdle BCF

fit_hurdle_bcf is the canonical entry point. It runs a joint shared-tree hurdle model: two forests (a prognostic mu forest and a treatment-effect tau forest), each with shared tree structure but separate leaf values for the conversion channel and the severity channel. Grow / prune proposals are accepted using the joint hurdle log-marginal likelihood.

For the tutorial we deliberately set the sizing knobs small. These are not production values — they are tuned to finish on JAX-CPU in well under two minutes so the doc test suite can run the tutorial on every PR. The “GPU recommended for larger problems” callout above applies to production-sized problems, not this example.

from pytyche import GPUBCFConfig, fit_hurdle_bcf

config = GPUBCFConfig(
    num_burnin=40,
    num_mcmc=80,
    num_trees_mu=30,
    num_trees_tau=15,
    max_depth=4,
    num_gfr_sweeps=2,
    diagnostic_interval=20,
    random_seed=0,
)

result = fit_hurdle_bcf(X, Z, Y_rev, propensity, config)

result is a HurdleBCFResult. The field that matters for posterior interpretation is rpv_cate_samples: an (n, S) array of posterior draws of the per-visitor revenue-per-visitor CATE, where S = num_mcmc / thin_factor * num_chains (here S = 80).

Read the posterior

The first thing to look at is the overall posterior — does the model think there is, on average, a lift? Marginalizing over visitors and draws gives the population-level effect.

overall_posterior_per_draw = result.rpv_cate_samples.mean(axis=0)  # (S,)
overall_mean = overall_posterior_per_draw.mean()
overall_lo = np.quantile(overall_posterior_per_draw, 0.025)
overall_hi = np.quantile(overall_posterior_per_draw, 0.975)

print(f"Overall posterior RPV lift: mean={overall_mean:.3f} "
      f"(95% CI: {overall_lo:.3f} - {overall_hi:.3f}) | "
      f"true={bundle.truth.effect:.3f}")

The 95% credible interval here is over draws, not over visitors. It is the posterior uncertainty about the population-level lift, not about any individual visitor.

The point of an adaptive-enrichment design, though, is not the average — it is the segment-level GATE. Pytyche is built around segment-level inference (see overview §”Segment- level GATE focus”); the posterior interpretation step that matters is per-segment.

segment_array = visitors["segment"].to_numpy()
cate_per_draw = result.rpv_cate_samples  # (n, S)

print(f"\n{'segment':<16} {'post mean':>10} {'2.5%':>10} {'97.5%':>10} {'truth':>10}")
print("-" * 60)
for seg_name in SEGMENTS:
    mask = segment_array == seg_name
    # Per-draw segment GATE: average CATE over visitors in the segment,
    # then summarize across draws.
    seg_gates = cate_per_draw[mask].mean(axis=0)  # (S,)
    true_seg_cate = float(bundle.truth.cate_per_visitor.values[mask][0])
    print(f"{seg_name:<16} {seg_gates.mean():>10.3f} "
          f"{np.quantile(seg_gates, 0.025):>10.3f} "
          f"{np.quantile(seg_gates, 0.975):>10.3f} "
          f"{true_seg_cate:>10.3f}")

assert np.all(np.isfinite(cate_per_draw)), "posterior must be finite"
assert cate_per_draw.shape == (len(visitors), config.num_mcmc)

At this small sample size and small MCMC budget, the posterior is wide, but the structural answer is right: the responders segment posterior mass sits well above zero, and the non_responders 95% credible interval contains zero. That separation is the deliverable — the model discriminates between segments where treatment helps and segments where it doesn’t.

What you should not read off the output is a precise point estimate for each visitor’s individual CATE. Per-row CATEs from the joint hurdle BCF are noisy at this sample size; the library is calibrated to deliver honest segment-level GATEs, not individual-level point predictions. The overview doc explains why this is the right level of inferential ambition.

Scaling considerations

The sizing knobs above (num_burnin=40, num_mcmc=80, num_trees_mu=30, num_trees_tau=15) are tutorial-grade. Production adaptive-enrichment fits use larger sample sizes per round (thousands to tens of thousands of visitors), more trees (typical defaults are num_trees_mu=200, num_trees_tau=50), longer MCMC chains (num_burnin=200, num_mcmc=200 or more), and multiple chains for between-chain convergence diagnostics. Those settings are GPU territory — the JAX kernels in pytyche.bcf are written for CUDA and become impractical on CPU as the problem grows. For benchmark-grade timing claims, see the artifacts in bench/ rather than trusting prose.

Where to go next

  • Your first adaptive experiment — run the full multi-round design with Thompson allocation and controls retention; the product’s main loop, end to end.

  • Working with the posterior — the reference companion for everything a fitted posterior can do: segmentation, allocation, decision support.

  • BCF posterior calibration at scale — why raw credible intervals are too narrow at large n, and how recalibration corrects them. The hands-on calibration walkthrough is not yet written; this concept page is the current reference.

  • Intended use — if you haven’t read the overview yet, this is the right time.