--- title: Running your first adaptive, sequential experiment review-state: drafting last-human-review: "2026-06-05" depends-on: - src/pytyche/experiment - src/pytyche/analysis - src/pytyche/contracts.py - src/pytyche/generators/scenarios.py owner: tradcliffe quadrant: tutorial jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 kernelspec: display_name: Python 3 language: python name: python3 --- # Running your first adaptive, sequential experiment You're a growth PM evaluating three checkout-promotion treatments. You don't know which works, or for whom — and a standard A/B test won't tell you: it reports one average lift, hiding exactly the per-customer structure you'd need to target. This tutorial runs an adaptive sequential experiment instead: each round refits a {term}`joint hurdle BCF` on the cumulative data, refines a decision tree mapping visitor features to recommended treatments, and routes the next round's traffic through that tree's segments. One call — `pt.sequential_experiment(...)` — drives the whole loop. The experiment is realistically sized: 350,000 visitors across three rounds, end-to-end in about fifteen minutes on a single consumer GPU. That is the point of the library — production-scale Bayesian causal inference on hardware you already have. This is the loop you're about to run — per round, the cell allocation that served traffic and the policy tree it shipped: ![Round-by-round evolution of cell allocations and the policy tree, animated](../_static/first-adaptive-experiment-evolution.gif) ## Setup ```{code-cell} ipython3 import pytyche as pt report = pt.check_setup() ``` The fits in this tutorial expect the CUDA device shown above (on CPU they run, slowly, and warn once). `calibration artifacts: (none bundled)` is why this tutorial [runs uncalibrated](#constructing-the-sequential-experiment). ## The scenario Three candidate checkout-promotion treatments competing for a fixed visitor budget: - `control` — no promotion - `low_promo` — small storefront promo - `free_ship` — free shipping above $50 Three rounds covering 350,000 visitors total, on a doubling-batch schedule (50,000 → 100,000 → 200,000). The data is generated by pytyche's `clustered_realistic` template, a four-cluster e-commerce mixture where the best treatment differs by customer cluster. ```{code-cell} ipython3 treatments = ["control", "low_promo", "free_ship"] generator = pt.simulated_experiment_generator( template="clustered_realistic", metric="revenue_per_visitor", effect_scale=0.1, K=3, seed=42, treatment_names=treatments, ) schedule = pt.GeometricSchedule(initial=50_000, growth=2.0, n_rounds=3) ``` `pt.simulated_experiment_generator(...)` wraps one of pytyche's data-generating templates as a *generator* — the callable a sequential experiment pulls each round's data from. `K` is the treatment count (control + K−1 active treatments), and `treatment_names=` maps the template's arms onto this experiment's naming. The modest `effect_scale` plants realistic per-segment effects: large enough to find, small enough that finding them takes data. A real deployment passes its own generator — any callable that pulls a round's data from your experimentation platform — and everything below stays the same. ## Constructing the sequential experiment `pt.sequential_experiment(...)` returns a stateful object you iterate round by round. Configuration is fixed at construction. Per-round overrides (custom cells, intervention) happen via the iteration loop. ```{code-cell} ipython3 exp = pt.sequential_experiment( generator=generator, schedule=schedule, treatments=treatments, min_control_weight=0.05, min_explore_weight=0.05, max_segment_depth=3, seed=42, ) ``` Parameters worth understanding before you ship: - `generator` is any callable that produces a round's observed data (and, in sim mode, the corresponding ground truth) when the experiment advances. Real-data runs pass a callable that pulls the round's data from the experimentation platform. - `min_control_weight=0.05` and `min_explore_weight=0.05` are the controls-retention floors. The Control cell never falls below 5% of round traffic. The Explore cell (uniform-random across treatments) also never falls below 5%. Together they guarantee that baseline measurement and every-treatment-observed sampling continue at every round, regardless of how confident the model becomes. - `calibration` defaults to `None`: posteriors are used uncorrected, and every round's results carry an explicit uncalibrated label. Uncalibrated posteriors are typically overconfident at scale, which also makes allocation concentrate on leaders faster than honest uncertainty would. Passing a calibration artifact instead applies an SBC-fitted coverage correction so posterior intervals stay honest — see [calibration at scale](../concepts/bcf-calibration-at-scale.md) for what that corrects and when it matters. - `progress=True` (not set here) renders live progress bars over each round's fit — useful in an interactive notebook while a multi-minute fit runs. ## Round 1: cold start The default round-1 cell structure is a Control cell and an Explore cell at 50/50 share. Half the visitors get the baseline. The other half get a uniform-random treatment, giving the model clean HTE signal to learn from. ```{code-cell} ipython3 r1 = next(exp) r1 ``` After round 1 the posterior is still finding its footing: there is preliminary signal (the `clustered_realistic` DGP plants real heterogeneity) but the per-segment picture is not yet settled. You can inspect the round's discovered segments directly. Each segment's `gate_estimate` is the posterior mean of its segment-level average treatment effect (the GATE), with `gate_ci` the 80% credible interval (the contracts.py convention, tighter than 95% to bias toward action). `stability_score` is a bootstrap-replicability score on the segment boundary; segments with `stability_score >= 0.80` are considered credible enough to act on. ```{code-cell} ipython3 r1.analysis ``` And the recommendation for the next round: ```{code-cell} ipython3 plan = r1.next_recommendation plan ``` `next_recommendation` carries the recommended cell structure for the next round. The operator may accept as-is, partially override (for example, add a hypothesis cell alongside the recommended Optimized cell), or fully replace. This tutorial accepts as-is throughout. ```{code-cell} ipython3 ax = pt.viz.plot_cells(plan.cells) ``` The Optimized cell takes 90% of round-2 traffic; the Control and Explore floors keep their guaranteed 5% shares no matter how confident the model becomes. ## Round 2: narrowing toward responders ```{code-cell} ipython3 r2 = next(exp) r2 ``` The cumulative posterior now incorporates 150,000 visitors. Round 1's recommendation introduced an Optimized cell that routes visitors per the policy tree fitted on round 1's CATEs, and the dashboard's scoreboard shows whether that targeted routing produced lift over Control — the headline number for the round. The HTE discovery itself is independent of cells: it is a property of the joint posterior over all the data, regardless of how that data was routed. ```{code-cell} ipython3 r2.analysis ``` ```{code-cell} ipython3 ax = pt.viz.plot_segment_intervals(r2.analysis.segments) ``` Expect narrower credible intervals than round 1, and the per-segment leaders to firm up. `arm_best_probabilities` (rendered per segment above) is the per-segment posterior probability that each arm is best — a leaf where the leader sits above `0.90` is one the engine is confident in, while a leader at `0.45` against a `0.40` runner-up is still exploring. The experiment exposes predicate accessors for the common questions: ```{code-cell} ipython3 { "credible segments yet": exp.has_credible_segments(), "graduation candidate yet": exp.has_graduation_candidate(), } ``` `has_credible_segments()` answers "is any segment credible"; `has_graduation_candidate()` answers "is any (treatment, segment) pair ready to ship." With effects this clear, the first graduation candidate often appears as early as round 2 — the floors keep the experiment honest either way. The actual candidate list comes from `exp.graduation_candidates(...)` (round 3, below). ## Round 3: mature segmentation ```{code-cell} ipython3 r3 = next(exp) r3 ``` After three rounds and 350,000 cumulative visitors, the policy tree's segmentation has matured. The default graduation rule (`ExpectedLossRule`, evaluated over each round's `recommendation_summary` decision evidence) fires for a (treatment, segment) pair when all three conditions hold across consecutive rounds: expected loss below tolerance, probability of outperforming control above 0.95, and probability of meaningful improvement above 0.80. ```{code-cell} ipython3 candidates = exp.graduation_candidates(sustained_rounds=2) candidates[0] ``` A run at this scale typically graduates one to three (treatment, segment) pairs by round 3 — the cell shows the first; `candidates` holds them all. Each candidate is structured data. The operator (or an automated workflow) decides whether to promote one to broader rollout. The library does not auto-ship. The candidate's `latest_recommendation` carries the full decision evidence, including `expected_value_of_one_more_round`: the expected per-visitor reduction in regret from running one more round at the same per-round n. A candidate where this value is near zero is one the experiment has effectively converged on — additional data is unlikely to change the decision. A non-zero value means you're still data-limited; ship only if the time-cost of another round exceeds the expected gain. See [decision-theoretic inputs](../concepts/decision-theoretic-inputs.md) for the formula and the compose-into-your-policy framing. ## What to do with the result ```{code-cell} ipython3 print(exp.summary) ``` ```{code-cell} ipython3 exp.confidence ``` `exp.summary` is a multi-paragraph prose summary of the latest round in the context of the experiment's history. `exp.confidence` is a one-word label (`"high"`, `"medium"`, or `"low"`) derived from the credible-segment count and graduation candidate state, useful for at-a-glance status in a notebook or dashboard. For platform handoff, `exp.next_recommendation` is the next-round cell structure your experimentation platform would consume. The structured Python dataclass is the canonical output. Platform integrations convert it to whatever shape the target configuration store expects. ## Sim-mode truth comparison Because this tutorial runs against a generator, each round carries a truth comparison: the planted ground truth against the round's estimate. Real-data runs (where the generator returns `truth=None`) skip these. ```{code-cell} ipython3 r3.truth_comparison ``` Policy accuracy is the fraction of visitors for whom the recommended treatment matches the truth-optimal treatment. Oracle gap is the RPV regret of the recommended policy versus the oracle policy. ## Round-by-round history The full per-round snapshots are attached to the experiment. ```{code-cell} ipython3 for past in exp.history: print(past.summary_one_line()) ``` Each entry carries the round's posterior, comparisons, summary recommendation, discovered segments, cells shipped, per-cell observations, and recommendation for the next round. The same evolution visually, one panel per round: ![Round 1: cold-start Control/Explore split and the first fitted policy tree](../_static/first-adaptive-experiment-round-1.png) ![Round 2: the Optimized cell takes 90% of traffic; the tree refines](../_static/first-adaptive-experiment-round-2.png) ![Round 3: the mature segmentation the experiment ships](../_static/first-adaptive-experiment-round-3.png) ## What's next This tutorial walked the simplest workflow: accept the recommended tree each round. From here: - The [injecting your own treatment hypotheses](injecting-your-own-treatment-hypotheses.md) tutorial shows how to add your own cells alongside the recommended cell structure when you have a theory you want to test in parallel. - [Advanced experimental design](../how-to/advanced-experimental-design.md) covers the statistical and budget knobs — graduation thresholds, control/explore floors, segmentation depth, schedule shape — and how to choose them for your domain. - The [calibration at scale](../concepts/bcf-calibration-at-scale.md) concept doc explains what a calibration artifact corrects and why this tutorial's uncalibrated run (the `calibration=None` default) is the right starting point but not the production recommendation. - [Statistical honesty](../concepts/statistical-honesty.md) explains why discovered-segment claims need the stability scores and credible intervals you saw above — and what goes wrong with post-hoc dashboard mining. - The [glossary](../concepts/glossary.md) defines every term used here.