---
title: Running your first adaptive, sequential experiment
review-state: drafting
last-human-review: "2026-06-05"
depends-on:
  - src/pytyche/experiment
  - src/pytyche/analysis
  - src/pytyche/contracts.py
  - src/pytyche/generators/scenarios.py
owner: tradcliffe
quadrant: tutorial
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

# Running your first adaptive, sequential experiment

You're a growth PM evaluating three checkout-promotion treatments.
You don't know which works, or for whom — and a standard A/B test
won't tell you: it reports one average lift, hiding exactly the
per-customer structure you'd need to target. This tutorial runs an
adaptive sequential experiment instead: each round refits a
{term}`joint hurdle BCF` on the cumulative data, refines a decision
tree mapping visitor features to recommended treatments, and routes
the next round's traffic through that tree's segments. One call —
`pt.sequential_experiment(...)` — drives the whole loop.

The experiment is realistically sized: 350,000 visitors across
three rounds, end-to-end in about fifteen minutes on a single
consumer GPU. That is the point of the library — production-scale
Bayesian causal inference on hardware you already have.

This is the loop you're about to run — per round, the cell
allocation that served traffic and the policy tree it shipped:

![Round-by-round evolution of cell allocations and the policy tree, animated](../_static/first-adaptive-experiment-evolution.gif)

## Setup

```{code-cell} ipython3
import pytyche as pt

report = pt.check_setup()
```

The fits in this tutorial expect the CUDA device shown above (on
CPU they run, slowly, and warn once). `calibration artifacts: (none
bundled)` is why this tutorial
[runs uncalibrated](#constructing-the-sequential-experiment).

## The scenario

Three candidate checkout-promotion treatments competing for a fixed
visitor budget:

- `control` — no promotion
- `low_promo` — small storefront promo
- `free_ship` — free shipping above $50

Three rounds covering 350,000 visitors total, on a doubling-batch
schedule (50,000 → 100,000 → 200,000). The data is generated by
pytyche's `clustered_realistic` template, a four-cluster e-commerce
mixture where the best treatment differs by customer cluster.

```{code-cell} ipython3
treatments = ["control", "low_promo", "free_ship"]

generator = pt.simulated_experiment_generator(
    template="clustered_realistic",
    metric="revenue_per_visitor",
    effect_scale=0.1,
    K=3,
    seed=42,
    treatment_names=treatments,
)

schedule = pt.GeometricSchedule(initial=50_000, growth=2.0, n_rounds=3)
```

`pt.simulated_experiment_generator(...)` wraps one of pytyche's
data-generating templates as a *generator* — the callable a
sequential experiment pulls each round's data from. `K` is the
treatment count (control + K−1 active treatments), and
`treatment_names=` maps the template's arms onto this experiment's
naming. The modest `effect_scale` plants realistic per-segment
effects: large enough to find, small enough that finding them takes
data. A real deployment passes its own generator — any callable
that pulls a round's data from your experimentation platform — and
everything below stays the same.

## Constructing the sequential experiment

`pt.sequential_experiment(...)` returns a stateful object you
iterate round by round. Configuration is fixed at construction.
Per-round overrides (custom cells, intervention) happen via the
iteration loop.

```{code-cell} ipython3
exp = pt.sequential_experiment(
    generator=generator,
    schedule=schedule,
    treatments=treatments,
    min_control_weight=0.05,
    min_explore_weight=0.05,
    max_segment_depth=3,
    seed=42,
)
```

Parameters worth understanding before you ship:

- `generator` is any callable that produces a round's observed data
  (and, in sim mode, the corresponding ground truth) when the
  experiment advances. Real-data runs pass a callable that pulls the
  round's data from the experimentation platform.
- `min_control_weight=0.05` and `min_explore_weight=0.05` are the
  controls-retention floors. The Control cell never falls below 5%
  of round traffic. The Explore cell (uniform-random across
  treatments) also never falls below 5%. Together they guarantee
  that baseline measurement and every-treatment-observed sampling
  continue at every round, regardless of how confident the model
  becomes.
- `calibration` defaults to `None`: posteriors are used uncorrected,
  and every round's results carry an explicit uncalibrated label.
  Uncalibrated posteriors are typically overconfident at scale, which
  also makes allocation concentrate on leaders faster than honest
  uncertainty would. Passing a calibration artifact instead applies
  an SBC-fitted coverage correction so posterior intervals stay
  honest — see
  [calibration at scale](../concepts/bcf-calibration-at-scale.md)
  for what that corrects and when it matters.
- `progress=True` (not set here) renders live progress bars over
  each round's fit — useful in an interactive notebook while a
  multi-minute fit runs.

## Round 1: cold start

The default round-1 cell structure is a Control cell and an Explore
cell at 50/50 share. Half the visitors get the baseline. The other
half get a uniform-random treatment, giving the model clean HTE
signal to learn from.

```{code-cell} ipython3
r1 = next(exp)

r1
```

After round 1 the posterior is still finding its footing: there is
preliminary signal (the `clustered_realistic` DGP plants real
heterogeneity) but the per-segment picture is not yet settled.

You can inspect the round's discovered segments directly. Each
segment's `gate_estimate` is the posterior mean of its segment-level
average treatment effect (the GATE), with `gate_ci` the 80% credible
interval (the contracts.py convention, tighter than 95% to bias
toward action). `stability_score` is a bootstrap-replicability score
on the segment boundary; segments with `stability_score >= 0.80` are
considered credible enough to act on.

```{code-cell} ipython3
r1.analysis
```

And the recommendation for the next round:

```{code-cell} ipython3
plan = r1.next_recommendation

plan
```

`next_recommendation` carries the recommended cell structure for
the next round. The operator may accept as-is, partially override
(for example, add a hypothesis cell alongside the recommended
Optimized cell), or fully replace. This tutorial accepts as-is
throughout.

```{code-cell} ipython3
ax = pt.viz.plot_cells(plan.cells)
```

The Optimized cell takes 90% of round-2 traffic; the Control and
Explore floors keep their guaranteed 5% shares no matter how
confident the model becomes.

## Round 2: narrowing toward responders

```{code-cell} ipython3
r2 = next(exp)

r2
```

The cumulative posterior now incorporates 150,000 visitors. Round
1's recommendation introduced an Optimized cell that routes visitors
per the policy tree fitted on round 1's CATEs, and the dashboard's
scoreboard shows whether that targeted routing produced lift over
Control — the headline number for the round. The HTE discovery
itself is independent of cells: it is a property of the joint
posterior over all the data, regardless of how that data was routed.

```{code-cell} ipython3
r2.analysis
```

```{code-cell} ipython3
ax = pt.viz.plot_segment_intervals(r2.analysis.segments)
```

Expect narrower credible intervals than round 1, and the
per-segment leaders to firm up. `arm_best_probabilities` (rendered
per segment above) is the per-segment posterior probability that
each arm is best — a leaf where the leader sits above `0.90` is one
the engine is confident in, while a leader at `0.45` against a
`0.40` runner-up is still exploring. The experiment exposes
predicate accessors for the common questions:

```{code-cell} ipython3
{
    "credible segments yet": exp.has_credible_segments(),
    "graduation candidate yet": exp.has_graduation_candidate(),
}
```

`has_credible_segments()` answers "is any segment credible";
`has_graduation_candidate()` answers "is any (treatment, segment)
pair ready to ship." With effects this clear, the first graduation
candidate often appears as early as round 2 — the floors keep the
experiment honest either way. The actual candidate list comes from
`exp.graduation_candidates(...)` (round 3, below).

## Round 3: mature segmentation

```{code-cell} ipython3
r3 = next(exp)

r3
```

After three rounds and 350,000 cumulative visitors, the policy
tree's segmentation has matured. The default graduation rule
(`ExpectedLossRule`, evaluated over each round's
`recommendation_summary` decision evidence) fires for a (treatment,
segment) pair when all three conditions hold across consecutive
rounds: expected loss below tolerance, probability of outperforming
control above 0.95, and probability of meaningful improvement above
0.80.

```{code-cell} ipython3
candidates = exp.graduation_candidates(sustained_rounds=2)

candidates[0]
```

A run at this scale typically graduates one to three (treatment,
segment) pairs by round 3 — the cell shows the first; `candidates`
holds them all. Each candidate is structured data. The operator (or
an automated workflow) decides whether to promote one to broader
rollout. The library does not auto-ship.

The candidate's `latest_recommendation` carries the full decision
evidence, including `expected_value_of_one_more_round`: the
expected per-visitor reduction in regret from running one more
round at the same per-round n. A candidate where this value is near
zero is one the experiment has effectively converged on —
additional data is unlikely to change the decision. A non-zero
value means you're still data-limited; ship only if the time-cost
of another round exceeds the expected gain. See
[decision-theoretic inputs](../concepts/decision-theoretic-inputs.md)
for the formula and the compose-into-your-policy framing.

## What to do with the result

```{code-cell} ipython3
print(exp.summary)
```

```{code-cell} ipython3
exp.confidence
```

`exp.summary` is a multi-paragraph prose summary of the latest
round in the context of the experiment's history.
`exp.confidence` is a one-word label (`"high"`, `"medium"`, or
`"low"`) derived from the credible-segment count and graduation
candidate state, useful for at-a-glance status in a notebook or
dashboard.

For platform handoff, `exp.next_recommendation` is the next-round
cell structure your experimentation platform would consume. The
structured Python dataclass is the canonical output. Platform
integrations convert it to whatever shape the target configuration
store expects.

## Sim-mode truth comparison

Because this tutorial runs against a generator, each round carries
a truth comparison: the planted ground truth against the round's
estimate. Real-data runs (where the generator returns `truth=None`)
skip these.

```{code-cell} ipython3
r3.truth_comparison
```

Policy accuracy is the fraction of visitors for whom the
recommended treatment matches the truth-optimal treatment. Oracle
gap is the RPV regret of the recommended policy versus the oracle
policy.

## Round-by-round history

The full per-round snapshots are attached to the experiment.

```{code-cell} ipython3
for past in exp.history:
    print(past.summary_one_line())
```

Each entry carries the round's posterior, comparisons, summary
recommendation, discovered segments, cells shipped, per-cell
observations, and recommendation for the next round.

The same evolution visually, one panel per round:

![Round 1: cold-start Control/Explore split and the first fitted policy tree](../_static/first-adaptive-experiment-round-1.png)

![Round 2: the Optimized cell takes 90% of traffic; the tree refines](../_static/first-adaptive-experiment-round-2.png)

![Round 3: the mature segmentation the experiment ships](../_static/first-adaptive-experiment-round-3.png)

## What's next

This tutorial walked the simplest workflow: accept the recommended
tree each round. From here:

- The [injecting your own treatment
  hypotheses](injecting-your-own-treatment-hypotheses.md) tutorial
  shows how to add your own cells alongside the recommended cell
  structure when you have a theory you want to test in parallel.
- [Advanced experimental
  design](../how-to/advanced-experimental-design.md) covers the
  statistical and budget knobs — graduation thresholds,
  control/explore floors, segmentation depth, schedule shape — and
  how to choose them for your domain.
- The [calibration at scale](../concepts/bcf-calibration-at-scale.md)
  concept doc explains what a calibration artifact corrects and why
  this tutorial's uncalibrated run (the `calibration=None` default)
  is the right starting point but not the production recommendation.
- [Statistical honesty](../concepts/statistical-honesty.md) explains
  why discovered-segment claims need the stability scores and
  credible intervals you saw above — and what goes wrong with
  post-hoc dashboard mining.
- The [glossary](../concepts/glossary.md) defines every term used
  here.