---
title: "Sequential Targeting via Segment-Based Enrichment"
review-state: drafting
last-human-review: "2026-06-09"
depends-on:
  - src/pytyche/experiment
  - src/pytyche/analysis
owner: unowned
quadrant: concept
---

# Sequential Targeting via Segment-Based Enrichment

Design rationale for pytyche's sequential surface: why the
HTE estimation → segment discovery → targeted allocation → re-estimation
flywheel compounds value over batched e-commerce experiments. The
operator surface for this loop is `pt.sequential_experiment(...)` — the
[first adaptive experiment](../tutorials/first-adaptive-experiment.md)
tutorial walks it end-to-end; this doc explains why it's shaped the way
it is.

## Problem Statement

We have a validated HTE estimation pipeline: joint hurdle BCF recovers
per-visitor CATEs decomposed into conversion and AOV channels. The
8-scenario benchmark (March 2026, n=50k) shows Joint BCF wins GATE RMSE
7/8 scenarios — meaning it correctly identifies *which groups* benefit vs.
are harmed by treatment.

The missing proof: **does acting on those estimates actually help?** If we
target treatment to estimated benefiters and withhold from estimated
non-benefiters, does realized lift exceed uniform allocation? Does the
improvement compound over sequential experiments as estimates sharpen?

Nobody has demonstrated this loop empirically for **batched** e-commerce
experiments on **zero-inflated revenue**. The closest work operates in
different settings (online bandits, single-shot policy evaluation,
continuous outcomes).

## Literature Positioning

### What exists

**Contextual bandits with HTE oracles (Carranza, Krishnamurthy, Athey):**
Connects contextual-bandit regret to HTE estimation quality — showing
CATE oracles are more sample-efficient than full reward modeling. But:
online/single-visitor updates, not batched experiments. Continuous
outcomes, not zero-inflated.

**Batched bandit theory (Perchet et al. 2016, arXiv 1505.00369; Esfandiari
et al. AAAI 2021; Che & Namkoong 2023, arXiv 2303.11582):**
Establishes that **Ω(log log T) batches suffice** to match the regret of
a fully sequential bandit, with geometric/doubling batch sizes. Che &
Namkoong 2023 specifically targets adaptive experimentation at scale
with flexible batches + delayed feedback. The theoretical anchor for
pytyche's geometric visitor scheduling.

**R-learner / causal forests (Nie & Wager 2021, Wager & Athey 2018):**
Foundation for CATE estimation with theoretical guarantees (asymptotic
normality, honesty). The R-learner's residual-on-residual approach based
on Frisch-Waugh-Lovell is elegant. But: single-experiment estimators, no
sequential targeting loop. Continuous outcomes only.

**Coarse personalization via segment compression (Zhang & Misra 2022,
"Coarse Personalization," arXiv 2204.05793, EC '24):**
A food-delivery promotions field experiment found that **5 discrete
segments recover 99.5% of full personalization value** — the policy
tree is a near-lossless compression of the CATE surface. But: single
experiment, off-the-shelf methods, continuous outcomes, no sequential
loop, no hurdle decomposition. Domain-specific result; counter-evidence
in Shchetkina 2024 (arXiv 2411.16552, withdrawn) reports 18% vs 4%
personalization gains depending on population conditions, suggesting
the 99.5% transfers cleanly to promo-like settings but isn't universal.

**Single-shot policy evaluation methodology (Hitsch, Misra & Zhang 2024,
QME 22:115-168):** Canonical marketing-science framing for off-policy
evaluation of arbitrary targeting policies from a single RCT. Related to
the above but addresses *how* to evaluate policy trees, not the
empirical 99.5% finding.

**Adaptive enrichment (clinical-trial designs):**
Adaptive-enrichment trial designs progressively narrow trial
populations to responsive subgroups. Closest to our loop conceptually:
estimate → select → re-estimate. But: very different statistical setting
(survival/binary outcomes, small samples, phase II/III drugs).

### The gap

| Setting | Prior work | Our contribution |
|---|---|---|
| Single-shot HTE + targeting | Zhang & Misra 2022; Hitsch/Misra/Zhang 2024 | — |
| Online sequential targeting | Carranza/Krishnamurthy/Athey | — |
| Batched sequential bandit theory | Perchet 2016 / Esfandiari 2021 / Che & Namkoong 2023 | — |
| Batched sequential HTE targeting | *Gap* | This work |
| Zero-inflated revenue outcomes | *Gap* | This work |
| Segment-based enrichment for e-commerce | *Gap* | This work |

The contribution is compositional: batched sequential targeting + hurdle
BCF + segment-based enrichment + e-commerce application. No individual
component is novel, but the combination and empirical demonstration on
zero-inflated revenue hasn't been done.

## Approach: Segment-Based Enrichment

### Why not per-visitor propensities?

The straightforward approach is: estimate τ̂(x) per visitor → set
treatment probability proportional to estimated benefit. This is
theoretically optimal (maximizes expected welfare) but practically
problematic:

1. **Instability**: individual CATE estimates are noisy; small changes in
   estimation produce large swings in per-visitor propensities across
   rounds, making the system erratic.

2. **Operational opacity**: a 50k-row propensity table is impossible for a
   merchandising team to act on. You can't brief stakeholders on "visitor
   #34,721 gets p=0.73."

3. **Statistical complexity**: non-uniform propensities require inverse
   probability weighting for valid estimation, introducing variance
   inflation and coverage issues.

4. **Scope limitation**: per-visitor propensities only work for real-time
   storefront personalization. They can't drive cross-vertical business
   decisions ("should we also change email targeting for this segment?").

### Segment-based enrichment model

Instead, we fit a **shallow decision tree** (depth 2-3) on the CATE
estimates to produce **human-interpretable segments**. Each segment gets a
binary decision: treat or control.

```
BCF ensemble (50 tau trees × 100+ MCMC draws)
     │
     │  posterior mean τ̂(x) per visitor
     ▼
Policy tree (single DecisionTreeRegressor, depth 2)
     │
     │  4-8 segments with feature-based rules
     ▼
Segment allocation:
  "benefit" segments → 50/50 uniform (active experimentation)
  "harm" segments    → 100% control (with ε exploration floor)
```

**Within active segments, assignment is uniform 50/50.** This preserves
maximum statistical power per observation and avoids propensity correction.
The "targeting" is in segment *selection*, not assignment *biasing*.

This is essentially **adaptive enrichment** — the clinical-trial design
pattern adapted to e-commerce.

### The information bottleneck

Individual BCF tau trees are mostly **stumps** (α_tau=0.25, β_tau=3.0
means only 25% chance of splitting at root, 3% at depth 1). But the
ensemble of 50 trees × 100+ MCMC draws captures a complex CATE surface
through additive composition.

The policy tree is a deliberate compression layer: "given everything BCF
learned, what are the 4-8 most important groups?" This is the
coarse-personalization insight from Zhang & Misra 2022 quantified —
5 segments recover 99.5% of full personalization value in their
food-delivery promo experiment. The remaining 0.5% buys enormous
operational simplicity. Domain-specific result; transfer to other
settings (especially zero-inflated) hasn't been directly studied.

### Why this matters for practice

The segment framing is how real experimentation teams operate:

1. **Stability over time**: coarse groups (e.g., "mobile users with >3
   prior purchases") are robust to population drift. Per-visitor
   propensities shift with every new visitor.

2. **Cross-vertical actionability**: segments can drive decisions beyond
   the storefront — email campaigns, pricing changes, inventory allocation.
   "High-engagement desktop users benefit from our premium layout" is
   actionable across channels.

3. **Progressive confidence**: early rounds use heavy control allocation
   (discovery). As segment effects stabilize, control fraction decreases
   (monitoring). The transition from discovery to exploitation is a
   configurable schedule, not a binary switch.

## Core Loop

The per-round estimator is the **multi-arm joint hurdle BCF**
({doc}`multi-arm-hurdle-bcf`): a single shared prognostic forest plus
a tau forest whose leaves carry a `(K−1)` contrast vector, yielding a
joint posterior over all treatment-vs-control contrasts per visitor.
The sequential engine calls the same public analysis primitives that
power users call directly — there is no private variant. For the type
hierarchy see {doc}`result-objects`; for a hands-on walkthrough of each
primitive see
[Working with the posterior](../tutorials/working-with-the-posterior.md).

```
Round 1: uniform allocation across all arms
         → pt.fit(observed, pooling="joint")
             returns HurdleBCFResult (per-visitor rpv_cate_samples)
         → posterior.fit_policy_tree(...)
             fits policy tree on cumulative per-visitor CATEs
             → initial segment-to-arm mapping

Round r>1: classify new visitors into segments from round r-1
           → each segment → assigned to its best arm (or control)
           → "uncertain" segments: exploration allocation
           → observe outcomes
           → pt.fit(accumulated_observed, pooling="joint")
               fits on ALL accumulated data → updated HurdleBCFResult
           → posterior.fit_policy_tree(...)
               refit policy tree → updated segment-to-arm mapping

Final:   converged segments with stable per-arm treatment rules
```

Per-round primitives in order:

1. **Estimator fit** — `pt.fit(observed, pooling="joint")` (equivalently
   `fit_hurdle_bcf(observed, pooling="joint")`) takes an
   `ObservedExperimentData` and returns a `HurdleBCFResult`. The engine
   accumulates all data across rounds before each fit; the result carries
   `rpv_cate_samples: (n, S_total, K−1)` — one contrast column per
   treatment vs control.

2. **Segmentation** — `posterior.fit_policy_tree(max_depth=...,
   min_segment_share=...)` fits a policy tree on the per-visitor
   posterior-mean CATE vectors, returning a `PolicyTreeResult` with
   `segments: list[DiscoveredSegment]`. Each segment carries
   `stability_score` (bootstrap replicability), `gate_estimate`, and
   `arm_best_probabilities` keyed by all variant names including control.

3. **Allocation** — `posterior.thompson_allocation(segments, epsilon=...)`.
   Per-leaf allocation is computed from the fraction of posterior draws in
   which each arm is best under the **shared best-arm rule**:
   `best_arm(δ) = 0` (control) if `max_j δ_j ≤ 0`, else `argmax_j δ_j + 1`.
   Control is a first-class winner — it takes a draw exactly when every
   contrast is non-positive. The `epsilon` kwarg is the internal
   Thompson safety-net floor (`ε/K` per active treatment). It is NOT
   the operator-facing controls-retention dial; controls retention at the
   L1 surface is `min_control_weight` / `min_explore_weight` on
   `pt.sequential_experiment(...)`.

4. **Calibration application** — `posterior.apply_calibration(calibration)`
   returns a new posterior of the same type with the R(p) + scale-family
   correction applied. K=2 only in v0.2; K≥3 per-contrast calibration
   raises `NotImplementedError` until the per-contrast SBC machinery
   ships with the sequential-surface calibration work.

5. **Ship / stop / continue** — `posterior.recommendation_summary(
   treatment, segment)` returns a `RecommendationSummary` with
   `decision`, `expected_loss_comparison`, `probability_positive`,
   `probability_better`, `probability_harmful`, and
   `expected_value_of_one_more_round`. `expected_value_of_one_more_round`
   is the information-theoretic value of running one additional round at
   the same per-round n. Near-zero means the experiment has converged and
   additional data is unlikely to change the decision.

### Key design decisions

**Data accumulation**: BCF is refit on all data across rounds, not just the
latest round. This gives maximum sample size for estimation but requires
tracking per-visitor propensities (which vary by round and segment
membership). Within active segments, propensity = 0.5. Within dropped
segments, propensity = ε/2.

**Assignment through the generator's own hook**: `generate_v2_core`
decomposes into `sample_features → compute_potential_outcomes →
assign_and_observe → build_bundle`. The sim-mode adapter computes
policy-routed assignments from the round plan's cells and drives
`assign_and_observe` through its external `treatment_assignment` hook —
one generation path, with the realized per-visitor assignment
propensities recorded alongside the data.

**Epsilon exploration floor**: "dropped" segments still get a small ε
fraction treated. This allows detecting if the treatment effect changed
(e.g., the population shifted, or a product change made the treatment
beneficial for a previously-harmed group). ε=0.10 as default, sweepable.
This is the ε passed to `posterior.thompson_allocation(epsilon=...)` —
the within-Thompson safety net. The cell-level controls-retention floor
(the L1 operator dial) is separate.

### Segment discovery vs. bandit optimization

This system is **not a contextual bandit**. The goal is not "optimal
per-visitor allocation to maximize cumulative reward." The goal is:
discover stable, interpretable population segments with consistent
treatment responses, then evolve the policy tree that routes them.

| | Contextual bandit | Segment discovery |
|---|---|---|
| Unit | individual visitor | segment (country × device × engagement) |
| Output | per-visitor propensity (opaque) | policy tree rules (interpretable) |
| Converges to | optimal allocation function | stable segment definitions |
| Actionable by | real-time personalization engine | merchandising team, cross-channel |

The policy tree's job is to produce rules like "high-engagement mobile
users in DE respond to premium layout" — actionable across storefront,
email, pricing. Per-visitor propensities can't drive these decisions.

This distinction determines the allocation basis (see below).

### Allocation basis: segment-mean vs. individual

The allocation probability P(τ>0) can be computed two ways:

- **Segment-mean**: P(segment-mean CATE > 0) across posterior draws.
  "Is treating this segment *as a group* net-positive?"
- **Individual**: fraction of individual posterior samples > 0.
  "What share of *visitors* in this segment have positive CATE?"

For segment discovery, segment-mean is correct. A segment where 75% of
visitors benefit (individual P(τ>0) ≈ 0.75) but the segment-mean CATE is
clearly positive (segment-mean P(τ>0) ≈ 0.95) should be treated
decisively — the within-segment heterogeneity is resolved by *refining
the tree* (splitting further), not by setting allocation to 0.75.

The individual-fraction approach produces the "knows but won't commit"
problem: segments with large true effects and narrow CIs still get mushy
p_treat values (0.6-0.8) because the segment is internally mixed.
Segment-mean pushes these to near 0 or 1, matching the binary
treat/hold decision the system is designed to make.

The individual-fraction approach was retained as a research baseline
during validation; segment-mean is what ships.

### The cell model

At the operator surface this three-way split is first-class: each round
ships a list of `Cell` objects (weight + assignment `Policy`):

- **Control cell** (`BaselinePolicy`): no treatment — clean
  counterfactual baseline. Never falls below `min_control_weight`.
- **Explore cell** (`UniformPolicy`): randomized across treatments —
  clean RCT signal for HTE discovery. Never falls below
  `min_explore_weight`.
- **Optimized cell** (`TreePolicy`): policy-tree-routed treatment —
  value capture from learned segments.

The Optimized cell has full expressive power: different segments get
different treatments (not just treat/hold). Inside it, the Thompson
`epsilon` controls how close per-segment allocation can get to 0 or 1
for any arm — higher epsilon means propensities closer to uniform-K.
The cell weights are the operator-facing dial; epsilon is the
within-cell safety net. Operators can add their own hypothesis cells
alongside these three (see the
[injecting your own treatment hypotheses](../tutorials/injecting-your-own-treatment-hypotheses.md)
tutorial).

## Metrics

| Metric | Source | What it answers |
|---|---|---|
| Expected lift | True p0/p1/m0/m1 | "How good is the policy, ignoring noise?" |
| Realized lift | Observed revenue | "What does a practitioner actually see?" |
| Oracle lift | Treat iff true_tau > 0 | "Upper bound with perfect information" |
| Uniform lift | ATE from 50/50 | "Baseline: what if we didn't target at all?" |
| Targeting regret | Oracle − realized | "How far from optimal per round?" |
| Cumulative regret | Sum across rounds | "Does learning converge?" (should flatten) |
| CATE RMSE | Per-visitor estimates | "How accurate are individual CATEs?" |
| GATE RMSE | Per-quartile group averages | "How accurate are group-level effects?" |
| Segment stability | Jaccard similarity | "Are segment boundaries converging?" |
| Segment count | Active segments per round | "How complex is the targeting rule?" |
| Policy accuracy | Fraction matching oracle | "What % of visitors get correct assignment?" |

**Expected vs. realized lift**: Expected lift isolates policy quality by
using true potential outcomes. Realized lift includes sampling noise. Both
should increase over rounds, with realized lift converging toward expected.

**GATE RMSE is the leading indicator for targeting quality**: if the
estimator correctly identifies which *segments* benefit vs. are harmed, the
policy tree will make good decisions regardless of individual CATE noise.
This is why Joint BCF's GATE RMSE superiority (7/8 scenarios in benchmark)
matters more than Proto's individual ranking advantage.

## Multi-Arm Targeting

The multi-arm joint hurdle BCF ({doc}`multi-arm-hurdle-bcf`) fits a
single prognostic forest shared across all arms and a tau forest whose
leaves carry a `(K−1)` contrast vector — one contrast per treatment vs.
control. The joint posterior over contrasts is what makes the policy tree's
per-arm assignment calibrated: the best-arm rule is applied within each
correlated posterior draw, so winner's-curse bias is eliminated.

This matches real-world experimentation practice: "free shipping" vs.
"10% discount" vs. "bundle offer" — different visitor segments may respond
to different promotions. The policy tree produces rules like:

```
IF recency > 30 days:
  → low_promo ("show discount")
ELSE IF order_count > 5:
  → free_ship ("free shipping offer")
ELSE:
  → control (more data needed)
```

The per-round composition using the public L2 primitives:

```python
# Round 1: uniform allocation across arms
# (control, low_promo, free_ship, ...)

posterior = pt.fit(observed, pooling="joint")
# HurdleBCFResult; rpv_cate_samples: (n, S_total, K-1)

tree = posterior.fit_policy_tree(max_depth=3)
# PolicyTreeResult; tree.segments is the segment-to-arm mapping

allocation = posterior.thompson_allocation(segments=tree.segments)
# dict[leaf_id, dict[treatment_name, weight]]
# best_arm(δ) = 0 if max_j δ_j ≤ 0 else argmax_j δ_j + 1

# Round r: each segment → matched to its best arm
#          control fraction decays: 50% → 30% → 10%
```

**Progressive control decay**: early rounds keep heavy control allocation
for discovery. As estimates stabilize (measured by segment stability and
GATE RMSE convergence), control fraction decreases. Control shifts from
"discovery" to "monitoring" — a configurable schedule rather than a binary
switch.

The policy tree labels at K≥3 are computed by applying the shared best-arm
rule to each visitor's posterior-mean contrast vector; the sklearn tree is
then a multiclass classifier routing future visitors to arms. `arm_best_probabilities`
on each `DiscoveredSegment` carries the per-draw frequencies that drove
the Thompson allocation — useful for spotting ambivalent segments (near-tied
leaders) vs. decisive ones. See
[Working with the posterior §2–3](../tutorials/working-with-the-posterior.md)
for a worked example of both.

## Scenario Priority

From the 8 validated templates, these best showcase sequential targeting:

1. **reversal** — CATE flips sign across visitor features. 50/50
   benefiters vs. harmed. Perfect targeting = 2x uniform lift. Best
   showcase: the policy tree must correctly separate groups.

2. **sparse_benefit** — only ~15% of visitors benefit. Targeting
   concentrates resources on the minority. Tests selectivity.

3. **monotone_gradient** — smooth CATE gradient across feature space.
   Tests how policies sharpen segment boundaries over rounds.

4. **constant** — homogeneous treatment effect. Targeting should NOT help.
   Sanity check: cumulative regret ≈ 0, no spurious segment discovery.

## Benchmark Foundation

Results from the 8-scenario calibration run (March 2026, n=50k, Joint
hurdle BCF vs. Proto vs. ST-independent):

| Scenario | Joint GATE RMSE | Proto GATE RMSE | Winner |
|---|---|---|---|
| constant | 0.03 | 0.10 | Joint |
| reversal | 0.32 | 0.52 | Joint |
| sparse_benefit | 0.13 | 0.21 | Joint |
| monotone_gradient | 0.04 | 0.47 | Joint |
| nonlinear_interaction | 0.09 | 0.06 | Proto |
| partial_null | 0.09 | 0.22 | Joint |
| high_noise | 0.20 | 0.44 | Joint |
| clustered | 0.36 | 0.37 | Joint |

Joint BCF wins GATE RMSE 7/8 scenarios. ST-independent retired (loses 0/8).
Proto retains advantage in per-visitor ranking for some scenarios.

For sequential targeting, GATE RMSE is the relevant metric — it measures
how accurately the estimator identifies group-level effects, which directly
determines policy tree quality.

## Performance at Scale

The [canonical adaptive
tutorial](../tutorials/first-adaptive-experiment.md) doubles as the
reference benchmark: three rounds of 50k/100k/200k visitors — cumulative
fits at 50k, 150k, and 350k — with K = 3 arms at the library-default fit
configuration, on a single workstation GPU (Quadro RTX 5000, 16 GB).
Measured across repeated runs (GPU fits are not bit-reproducible, so
treat these as bands, not pins):

| Cumulative n | Round wall-clock | Peak GPU memory |
|---|---|---|
| 50k | ~2 min | ~0.7 GiB |
| 150k | ~4–4.5 min | ~1.0 GiB |
| 350k | ~8–10 min | ~2.3 GiB |

The whole three-round experiment runs end-to-end in roughly 15 minutes.
Round wall-clock covers the full per-round pipeline — the BCF fit
dominates; segment discovery, recommendation, and truth comparison are
seconds.

How that scales:

- **Below n ≈ 50k, wall-clock is flat.** Fixed costs (JIT compilation,
  chain warmup) dominate, so small fits all cost on the order of a
  couple of minutes regardless of n.
- **Beyond the flat region, round time grows roughly linearly with
  cumulative n** — about 1.1–1.6 s per 1,000 visitors at K = 3
  defaults on this card. The upper end of that band is not pure data
  volume: the loop deliberately grows the tau forest as evidence
  accumulates (the adaptive `num_trees_tau` sizing), so later rounds
  buy more model capacity along with more data.
- **Arms multiply cost mildly.** At n = 50k, a K = 3 fit costs ~1.4×
  the K = 2 paired fit (fit-only, same configuration, per-K
  compilation included).
- **Memory is not the binding constraint at these scales.** 350k
  cumulative visitors peak near 2.3 GiB — a 16 GB consumer card has
  headroom for cumulative fits well past a million visitors before
  memory planning is needed.

## Pre-release Empirical Findings

These findings come from the pre-release research loop that validated
the approach (since superseded by `pt.sequential_experiment` as the one
public surface). The evidence stands; the tooling it names is
historical.

### Effect scale regime

`effect_scale` controls CATE magnitude in the scenario templates. At `es=1.0`
(the stress-test regime designed for BCF benchmarking), the signal-to-noise
ratio is unrealistically favorable — round 1→2 captures nearly all oracle
lift with no room for the flywheel to demonstrate progressive improvement.

At realistic e-commerce scales (`es=0.10–0.20`, `baseline_rate=0.03`), the
reversal template's heterogeneity vanishes because `effect_scale` currently
controls both the average treatment effect and heterogeneity amplitude via a
single multiplier on logit-scale slopes. **The clustered template is the
best base for targeting simulation** because its cluster structure preserves
directional heterogeneity (38% negative-τ visitors) across all effect scales.

**Open design issue:** `effect_scale` conflates main effect and heterogeneity
amplitude. A two-knob parameterization (separate `main_delta` and `hte_scale`)
would allow independent control. The clustered template naturally provides
some of this via its cluster structure, but the logit-additive templates
(reversal, monotone_gradient, interaction_only) need the fix.

**Scenario catalog next step:** extend `MixtureConfig` to support richer
cluster structures for realistic targeting simulation:
- 4-6 clusters with distinct treatment response profiles (benefit, neutral,
  harmed — not just 2 groups)
- Per-cluster effect vectors (different conversion and severity responses)
- Separate `main_delta` (average treatment effect) and `hte_scale`
  (heterogeneity amplitude) knobs
- Effect magnitudes calibrated in probability space, not logit scale
The existing `interaction_only` and `nonlinear` templates already cover
multi-feature effect surfaces — the gap is in group structure complexity
and independent effect parameterization, not interaction patterns.

### Flywheel convergence (clustered, es=0.20, baseline_rate=0.03)

With fixed 10k visitors/round, depth 2:
- **Round 1**: uniform allocation, GATE RMSE 0.53, oracle gap 0.14
- **Round 3**: policy captures ~75% of oracle gap, GATE RMSE 0.09
- **Rounds 4-5**: stabilize, GATE RMSE ~0.05

The flywheel works: estimation improves → policy improves → data becomes
more informative. But depth-2 can't fully close the oracle gap (treats ~90%
when oracle says ~61% benefit) because 4 segments can't separate the
underlying cluster structure finely enough.

### Visitor scheduling

The batched-bandit literature (Perchet et al. 2016, arXiv 1505.00369;
Esfandiari et al. AAAI 2021; Che & Namkoong 2023, arXiv 2303.11582)
establishes that **Ω(log log T) batches** with geometric/doubling batch
sizes suffice to match a fully sequential bandit's regret — i.e., very
few rounds are theoretically necessary. In e-commerce, doubling visitors
means doubling round duration — feasible for optimization programs
spanning months.

Geometric scheduling (round k gets `n_base × 2^k` visitors) is
first-class at the public surface (`GeometricSchedule`). Its
theoretical advantage requires the base batch to be large enough for
the BCF to produce informative posteriors. At 3% baseline conversion,
`n_base < 5000` yields too few events (~30 treated conversions) for
reliable CATE estimation.

### Adaptive tree depth (open question)

Deeper policy trees can capture finer heterogeneity but need more data to
avoid overfitting noisy CATE estimates. Sample-size-based depth rules
(`depth = log2(N / n_min_leaf)`) are too crude — they ignore estimation
quality and can produce overfit deep trees on noisy early-round CATEs.

The principled approach is **cross-validated policy value**: for each
candidate depth, evaluate held-out expected RPV under that policy. This
automatically adapts to signal strength rather than just data volume.
Not yet implemented.

## Related concepts

- {doc}`multi-arm-hurdle-bcf` — the per-round estimator: joint posterior
  over K−1 treatment-vs-control contrasts, and how result shapes map to
  per-segment arm probabilities
- {doc}`result-objects` — the type hierarchy: `HurdleBCFResult`,
  `AnalysisResult`, and `Experiment`; the observed-stashing contract that
  lets analysis primitives reach the input data without it being passed
  again
- [Working with the posterior](../tutorials/working-with-the-posterior.md)
  — hands-on walkthrough of each analysis primitive the sequential engine
  composes: `fit_policy_tree`, `thompson_allocation`, `apply_calibration`,
  `recommendation_summary`
- {doc}`overview` — what pytyche does and who it's for
- {doc}`bcf-calibration-at-scale` — the benchmark results behind GATE RMSE
  comparisons cited above