Running your first adaptive, sequential experiment

You’re a growth PM evaluating three checkout-promotion treatments. You don’t know which works, or for whom — and a standard A/B test won’t tell you: it reports one average lift, hiding exactly the per-customer structure you’d need to target. This tutorial runs an adaptive sequential experiment instead: each round refits a joint hurdle BCF on the cumulative data, refines a decision tree mapping visitor features to recommended treatments, and routes the next round’s traffic through that tree’s segments. One call — pt.sequential_experiment(...) — drives the whole loop.

The experiment is realistically sized: 350,000 visitors across three rounds, end-to-end in about fifteen minutes on a single consumer GPU. That is the point of the library — production-scale Bayesian causal inference on hardware you already have.

This is the loop you’re about to run — per round, the cell allocation that served traffic and the policy tree it shipped:

Round-by-round evolution of cell allocations and the policy tree, animated

Setup

import pytyche as pt

report = pt.check_setup()
pytyche 0.1.0.dev0
bartz   0.9.0
JAX devices: cuda:0
CUDA available: True
calibration artifacts: (none bundled)

The fits in this tutorial expect the CUDA device shown above (on CPU they run, slowly, and warn once). calibration artifacts: (none bundled) is why this tutorial runs uncalibrated.

The scenario

Three candidate checkout-promotion treatments competing for a fixed visitor budget:

  • control — no promotion

  • low_promo — small storefront promo

  • free_ship — free shipping above $50

Three rounds covering 350,000 visitors total, on a doubling-batch schedule (50,000 → 100,000 → 200,000). The data is generated by pytyche’s clustered_realistic template, a four-cluster e-commerce mixture where the best treatment differs by customer cluster.

treatments = ["control", "low_promo", "free_ship"]

generator = pt.simulated_experiment_generator(
    template="clustered_realistic",
    metric="revenue_per_visitor",
    effect_scale=0.1,
    K=3,
    seed=42,
    treatment_names=treatments,
)

schedule = pt.GeometricSchedule(initial=50_000, growth=2.0, n_rounds=3)

pt.simulated_experiment_generator(...) wraps one of pytyche’s data-generating templates as a generator — the callable a sequential experiment pulls each round’s data from. K is the treatment count (control + K−1 active treatments), and treatment_names= maps the template’s arms onto this experiment’s naming. The modest effect_scale plants realistic per-segment effects: large enough to find, small enough that finding them takes data. A real deployment passes its own generator — any callable that pulls a round’s data from your experimentation platform — and everything below stays the same.

Constructing the sequential experiment

pt.sequential_experiment(...) returns a stateful object you iterate round by round. Configuration is fixed at construction. Per-round overrides (custom cells, intervention) happen via the iteration loop.

exp = pt.sequential_experiment(
    generator=generator,
    schedule=schedule,
    treatments=treatments,
    min_control_weight=0.05,
    min_explore_weight=0.05,
    max_segment_depth=3,
    seed=42,
)

Parameters worth understanding before you ship:

  • generator is any callable that produces a round’s observed data (and, in sim mode, the corresponding ground truth) when the experiment advances. Real-data runs pass a callable that pulls the round’s data from the experimentation platform.

  • min_control_weight=0.05 and min_explore_weight=0.05 are the controls-retention floors. The Control cell never falls below 5% of round traffic. The Explore cell (uniform-random across treatments) also never falls below 5%. Together they guarantee that baseline measurement and every-treatment-observed sampling continue at every round, regardless of how confident the model becomes.

  • calibration defaults to None: posteriors are used uncorrected, and every round’s results carry an explicit uncalibrated label. Uncalibrated posteriors are typically overconfident at scale, which also makes allocation concentrate on leaders faster than honest uncertainty would. Passing a calibration artifact instead applies an SBC-fitted coverage correction so posterior intervals stay honest — see calibration at scale for what that corrects and when it matters.

  • progress=True (not set here) renders live progress bars over each round’s fit — useful in an interactive notebook while a multi-minute fit runs.

Round 1: cold start

The default round-1 cell structure is a Control cell and an Explore cell at 50/50 share. Half the visitors get the baseline. The other half get a uniform-random treatment, giving the model clean HTE signal to learn from.

r1 = next(exp)

r1
/tmp/ipykernel_298/2702889220.py:1: UncalibratedWarning: Running uncalibrated BCF posteriors (calibration=None); interval coverage may be miscalibrated at scale. Supply an SBC-fitted artifact via sequential_experiment(calibration=...) to correct it.
  r1 = next(exp)
round 0: revenue_per_visitor | 5 segment(s) | SHIP 'low_promo' (P(lift>0)=0.96)
cell n RPV (model 80% CI) lift vs control policy
control 25,002 2.4060 [2.4514, 2.6565] baseline: always control
explore 24,998 2.5586 [2.5054, 2.6461] +0.1526 [-0.0556, +0.1293] uniform over ['control', 'low_promo', 'free_ship']

truth: cate_rmse=0.9884 policy_accuracy=56.7% oracle_gap=0.1089/visitor

next: 3 cell(s), 100,000 visitors

After round 1 the posterior is still finding its footing: there is preliminary signal (the clustered_realistic DGP plants real heterogeneity) but the per-segment picture is not yet settled.

You can inspect the round’s discovered segments directly. Each segment’s gate_estimate is the posterior mean of its segment-level average treatment effect (the GATE), with gate_ci the 80% credible interval (the contracts.py convention, tighter than 95% to bias toward action). stability_score is a bootstrap-replicability score on the segment boundary; segments with stability_score >= 0.80 are considered credible enough to act on.

r1.analysis
AnalysisResult — clustered_realistic-es0.1-seed42-K3 · revenue_per_visitor
comparison lift (80% CI) P(lift > 0)
low_promo vs control +0.2517 [+0.0671, +0.4175] 0.96
free_ship vs control -0.1338 [-0.3137, +0.0336] 0.15
segment share GATE (80% CI) stability leader
browse_depth <= 4.66207 AND z0 <= -0.284063 AND z1 <= 1.08033 24% +0.7324 [+0.4363, +1.0170] 1.00 low_promo P=0.78
browse_depth > 4.66207 AND z0 <= -0.284063 AND z1 <= 1.08033 10% +1.2363 [+0.6899, +1.8175] 1.00 low_promo P=0.98
browse_depth <= 2.20055 AND z0 > -0.284063 AND z1 <= 1.08033 17% -0.3321 [-0.6974, -0.0356] 1.00 control P=0.91
browse_depth > 2.20055 AND z0 > -0.284063 AND z1 <= 1.08033 35% +0.4062 [+0.0482, +0.7227] 1.00 low_promo P=0.93
z1 > 1.08033 14% -0.7155 [-1.1026, -0.3230] 1.00 control P=0.97

recommendation: SHIP — low_promo

And the recommendation for the next round:

plan = r1.next_recommendation

plan
NextRoundPlan — 100,000 visitors, 3 cell(s)
cell weight policy
control 0.05 baseline: always control
explore 0.05 uniform over ['control', 'low_promo', 'free_ship']
optimized 0.90 policy tree routing over 5 segments with per-leaf Thompson allocation

next_recommendation carries the recommended cell structure for the next round. The operator may accept as-is, partially override (for example, add a hypothesis cell alongside the recommended Optimized cell), or fully replace. This tutorial accepts as-is throughout.

ax = pt.viz.plot_cells(plan.cells)
../_images/2618ba1151492d03d71f9faa72e3903565c59f0d9ae04260a08e5eda7095793b.png

The Optimized cell takes 90% of round-2 traffic; the Control and Explore floors keep their guaranteed 5% shares no matter how confident the model becomes.

Round 2: narrowing toward responders

r2 = next(exp)

r2
round 1: revenue_per_visitor | 4 segment(s) | SHIP 'low_promo' (P(lift>0)=1.00)
cell n RPV (model 80% CI) lift vs control policy
control 4,962 2.4665 [2.3952, 2.5358] baseline: always control
explore 4,993 2.6957 [2.4824, 2.6149] +0.2292 [+0.0059, +0.1399] uniform over ['control', 'low_promo', 'free_ship']
optimized 90,045 2.6529 [2.6483, 2.7351] +0.1864 [+0.1476, +0.2967] policy tree routing over 5 segments with per-leaf Thompson allocation

truth: cate_rmse=0.3491 policy_accuracy=61.8% oracle_gap=0.0982/visitor

next: 3 cell(s), 200,000 visitors

The cumulative posterior now incorporates 150,000 visitors. Round 1’s recommendation introduced an Optimized cell that routes visitors per the policy tree fitted on round 1’s CATEs, and the dashboard’s scoreboard shows whether that targeted routing produced lift over Control — the headline number for the round. The HTE discovery itself is independent of cells: it is a property of the joint posterior over all the data, regardless of how that data was routed.

r2.analysis
AnalysisResult — clustered_realistic-es0.1-seed42-K3 · revenue_per_visitor
comparison lift (80% CI) P(lift > 0)
low_promo vs control +0.2680 [+0.1577, +0.3798] 1.00
free_ship vs control -0.0589 [-0.1926, +0.0663] 0.29
segment share GATE (80% CI) stability leader
browse_depth <= 1.13612 17% -0.1177 [-0.3181, +0.0653] 1.00 control P=0.77
browse_depth > 1.13612 AND channel in {direct, email, organic, paid} AND z1 <= 1.08344 63% +0.4005 [+0.2860, +0.5126] 1.00 low_promo P=0.99
browse_depth > 1.13612 AND channel in {direct, email, organic, paid} AND z1 > 1.08344 10% +0.2535 [-0.2796, +0.4920] 1.00 low_promo P=0.78
browse_depth > 1.13612 AND channel == social 10% +0.0393 [-0.3715, +0.5043] 0.94 low_promo P=0.43

recommendation: SHIP — low_promo

ax = pt.viz.plot_segment_intervals(r2.analysis.segments)
../_images/4bd5c6e67745362a8959a62595862f2207b5a38bc5780bbdf45523c918dae4f0.png

Expect narrower credible intervals than round 1, and the per-segment leaders to firm up. arm_best_probabilities (rendered per segment above) is the per-segment posterior probability that each arm is best — a leaf where the leader sits above 0.90 is one the engine is confident in, while a leader at 0.45 against a 0.40 runner-up is still exploring. The experiment exposes predicate accessors for the common questions:

{
    "credible segments yet": exp.has_credible_segments(),
    "graduation candidate yet": exp.has_graduation_candidate(),
}
{'credible segments yet': True, 'graduation candidate yet': True}

has_credible_segments() answers “is any segment credible”; has_graduation_candidate() answers “is any (treatment, segment) pair ready to ship.” With effects this clear, the first graduation candidate often appears as early as round 2 — the floors keep the experiment honest either way. The actual candidate list comes from exp.graduation_candidates(...) (round 3, below).

Round 3: mature segmentation

r3 = next(exp)

r3
round 2: revenue_per_visitor | 5 segment(s) | SHIP 'low_promo' (P(lift>0)=1.00)
cell n RPV (model 80% CI) lift vs control policy
control 10,043 2.2150 [2.4135, 2.5027] baseline: always control
explore 10,050 2.5818 [2.4624, 2.5310] +0.3668 [+0.0029, +0.0873] uniform over ['control', 'low_promo', 'free_ship']
optimized 179,907 2.7793 [2.7350, 2.7986] +0.5643 [+0.2657, +0.3563] policy tree routing over 4 segments with per-leaf Thompson allocation

truth: cate_rmse=0.2989 policy_accuracy=67.2% oracle_gap=0.0677/visitor

next: 3 cell(s), no next round (schedule exhausted)

After three rounds and 350,000 cumulative visitors, the policy tree’s segmentation has matured. The default graduation rule (ExpectedLossRule, evaluated over each round’s recommendation_summary decision evidence) fires for a (treatment, segment) pair when all three conditions hold across consecutive rounds: expected loss below tolerance, probability of outperforming control above 0.95, and probability of meaningful improvement above 0.80.

candidates = exp.graduation_candidates(sustained_rounds=2)

candidates[0]
GraduationCandidate — 'low_promo' @ browse_depth > 1.07424 AND channel in {direct, email, social}, sustained 3 round(s)
  expected loss if shipped: 0.0000/visitor   P(lift > 0) = 1.00
  value of one more round: 0.0000/visitor

A run at this scale typically graduates one to three (treatment, segment) pairs by round 3 — the cell shows the first; candidates holds them all. Each candidate is structured data. The operator (or an automated workflow) decides whether to promote one to broader rollout. The library does not auto-ship.

The candidate’s latest_recommendation carries the full decision evidence, including expected_value_of_one_more_round: the expected per-visitor reduction in regret from running one more round at the same per-round n. A candidate where this value is near zero is one the experiment has effectively converged on — additional data is unlikely to change the decision. A non-zero value means you’re still data-limited; ship only if the time-cost of another round exceeds the expected gain. See decision-theoretic inputs for the formula and the compose-into-your-policy framing.

What to do with the result

print(exp.summary)
3 round(s) completed; confidence is high. Latest round 2: revenue_per_visitor | 5 segment(s) | SHIP 'low_promo' (P(lift>0)=1.00)

We have strong, sustained evidence: at least one discovered segment is stable across bootstrap refits, and a treatment has met the graduation thresholds in consecutive rounds. This round's summary decision is SHIP — the leading treatment clears the decision thresholds on the current evidence.

Next round: no further rounds (the schedule is exhausted), split control 5%, explore 5%, optimized 90%. The optimized cell routes through a policy tree over 5 discovered segment(s) with per-segment Thompson allocation.

Graduation candidates (surfaced for the operator — nothing is auto-graduated): low_promo in segment 4 (browse_depth > 1.07424 AND channel in {direct, email, social}); low_promo in segment 5 (browse_depth > 1.07424 AND channel == paid); low_promo in segment 7 (browse_depth > 1.07424 AND channel == organic AND z4 <= 0.468919); low_promo in segment 8 (browse_depth > 1.07424 AND channel == organic AND z4 > 0.468919).

Graduation candidates: low_promo in segment 4; low_promo in segment 5; low_promo in segment 7; low_promo in segment 8.
exp.confidence
'high'

exp.summary is a multi-paragraph prose summary of the latest round in the context of the experiment’s history. exp.confidence is a one-word label ("high", "medium", or "low") derived from the credible-segment count and graduation candidate state, useful for at-a-glance status in a notebook or dashboard.

For platform handoff, exp.next_recommendation is the next-round cell structure your experimentation platform would consume. The structured Python dataclass is the canonical output. Platform integrations convert it to whatever shape the target configuration store expects.

Sim-mode truth comparison

Because this tutorial runs against a generator, each round carries a truth comparison: the planted ground truth against the round’s estimate. Real-data runs (where the generator returns truth=None) skip these.

r3.truth_comparison
TruthComparison
  cate_rmse: 0.2989   policy_accuracy: 67%
  rpv — policy: 2.7891   uniform: 2.5434   oracle: 2.8568   oracle gap: 0.0677

Policy accuracy is the fraction of visitors for whom the recommended treatment matches the truth-optimal treatment. Oracle gap is the RPV regret of the recommended policy versus the oracle policy.

Round-by-round history

The full per-round snapshots are attached to the experiment.

for past in exp.history:
    print(past.summary_one_line())
round 0: revenue_per_visitor | 5 segment(s) | SHIP 'low_promo' (P(lift>0)=0.96)
round 1: revenue_per_visitor | 4 segment(s) | SHIP 'low_promo' (P(lift>0)=1.00)
round 2: revenue_per_visitor | 5 segment(s) | SHIP 'low_promo' (P(lift>0)=1.00)

Each entry carries the round’s posterior, comparisons, summary recommendation, discovered segments, cells shipped, per-cell observations, and recommendation for the next round.

The same evolution visually, one panel per round:

Round 1: cold-start Control/Explore split and the first fitted policy tree

Round 2: the Optimized cell takes 90% of traffic; the tree refines

Round 3: the mature segmentation the experiment ships

What’s next

This tutorial walked the simplest workflow: accept the recommended tree each round. From here:

  • The injecting your own treatment hypotheses tutorial shows how to add your own cells alongside the recommended cell structure when you have a theory you want to test in parallel.

  • Advanced experimental design covers the statistical and budget knobs — graduation thresholds, control/explore floors, segmentation depth, schedule shape — and how to choose them for your domain.

  • The calibration at scale concept doc explains what a calibration artifact corrects and why this tutorial’s uncalibrated run (the calibration=None default) is the right starting point but not the production recommendation.

  • Statistical honesty explains why discovered-segment claims need the stability scores and credible intervals you saw above — and what goes wrong with post-hoc dashboard mining.

  • The glossary defines every term used here.