Sequential Targeting via Segment-Based Enrichment¶
Design rationale for pytyche’s sequential surface: why the
HTE estimation → segment discovery → targeted allocation → re-estimation
flywheel compounds value over batched e-commerce experiments. The
operator surface for this loop is pt.sequential_experiment(...) — the
first adaptive experiment
tutorial walks it end-to-end; this doc explains why it’s shaped the way
it is.
Problem Statement¶
We have a validated HTE estimation pipeline: joint hurdle BCF recovers per-visitor CATEs decomposed into conversion and AOV channels. The 8-scenario benchmark (March 2026, n=50k) shows Joint BCF wins GATE RMSE 7/8 scenarios — meaning it correctly identifies which groups benefit vs. are harmed by treatment.
The missing proof: does acting on those estimates actually help? If we target treatment to estimated benefiters and withhold from estimated non-benefiters, does realized lift exceed uniform allocation? Does the improvement compound over sequential experiments as estimates sharpen?
Nobody has demonstrated this loop empirically for batched e-commerce experiments on zero-inflated revenue. The closest work operates in different settings (online bandits, single-shot policy evaluation, continuous outcomes).
Literature Positioning¶
What exists¶
Contextual bandits with HTE oracles (Carranza, Krishnamurthy, Athey): Connects contextual-bandit regret to HTE estimation quality — showing CATE oracles are more sample-efficient than full reward modeling. But: online/single-visitor updates, not batched experiments. Continuous outcomes, not zero-inflated.
Batched bandit theory (Perchet et al. 2016, arXiv 1505.00369; Esfandiari et al. AAAI 2021; Che & Namkoong 2023, arXiv 2303.11582): Establishes that Ω(log log T) batches suffice to match the regret of a fully sequential bandit, with geometric/doubling batch sizes. Che & Namkoong 2023 specifically targets adaptive experimentation at scale with flexible batches + delayed feedback. The theoretical anchor for pytyche’s geometric visitor scheduling.
R-learner / causal forests (Nie & Wager 2021, Wager & Athey 2018): Foundation for CATE estimation with theoretical guarantees (asymptotic normality, honesty). The R-learner’s residual-on-residual approach based on Frisch-Waugh-Lovell is elegant. But: single-experiment estimators, no sequential targeting loop. Continuous outcomes only.
Coarse personalization via segment compression (Zhang & Misra 2022, “Coarse Personalization,” arXiv 2204.05793, EC ‘24): A food-delivery promotions field experiment found that 5 discrete segments recover 99.5% of full personalization value — the policy tree is a near-lossless compression of the CATE surface. But: single experiment, off-the-shelf methods, continuous outcomes, no sequential loop, no hurdle decomposition. Domain-specific result; counter-evidence in Shchetkina 2024 (arXiv 2411.16552, withdrawn) reports 18% vs 4% personalization gains depending on population conditions, suggesting the 99.5% transfers cleanly to promo-like settings but isn’t universal.
Single-shot policy evaluation methodology (Hitsch, Misra & Zhang 2024, QME 22:115-168): Canonical marketing-science framing for off-policy evaluation of arbitrary targeting policies from a single RCT. Related to the above but addresses how to evaluate policy trees, not the empirical 99.5% finding.
Adaptive enrichment (clinical-trial designs): Adaptive-enrichment trial designs progressively narrow trial populations to responsive subgroups. Closest to our loop conceptually: estimate → select → re-estimate. But: very different statistical setting (survival/binary outcomes, small samples, phase II/III drugs).
The gap¶
Setting |
Prior work |
Our contribution |
|---|---|---|
Single-shot HTE + targeting |
Zhang & Misra 2022; Hitsch/Misra/Zhang 2024 |
— |
Online sequential targeting |
Carranza/Krishnamurthy/Athey |
— |
Batched sequential bandit theory |
Perchet 2016 / Esfandiari 2021 / Che & Namkoong 2023 |
— |
Batched sequential HTE targeting |
Gap |
This work |
Zero-inflated revenue outcomes |
Gap |
This work |
Segment-based enrichment for e-commerce |
Gap |
This work |
The contribution is compositional: batched sequential targeting + hurdle BCF + segment-based enrichment + e-commerce application. No individual component is novel, but the combination and empirical demonstration on zero-inflated revenue hasn’t been done.
Approach: Segment-Based Enrichment¶
Why not per-visitor propensities?¶
The straightforward approach is: estimate τ̂(x) per visitor → set treatment probability proportional to estimated benefit. This is theoretically optimal (maximizes expected welfare) but practically problematic:
Instability: individual CATE estimates are noisy; small changes in estimation produce large swings in per-visitor propensities across rounds, making the system erratic.
Operational opacity: a 50k-row propensity table is impossible for a merchandising team to act on. You can’t brief stakeholders on “visitor #34,721 gets p=0.73.”
Statistical complexity: non-uniform propensities require inverse probability weighting for valid estimation, introducing variance inflation and coverage issues.
Scope limitation: per-visitor propensities only work for real-time storefront personalization. They can’t drive cross-vertical business decisions (“should we also change email targeting for this segment?”).
Segment-based enrichment model¶
Instead, we fit a shallow decision tree (depth 2-3) on the CATE estimates to produce human-interpretable segments. Each segment gets a binary decision: treat or control.
BCF ensemble (50 tau trees × 100+ MCMC draws)
│
│ posterior mean τ̂(x) per visitor
▼
Policy tree (single DecisionTreeRegressor, depth 2)
│
│ 4-8 segments with feature-based rules
▼
Segment allocation:
"benefit" segments → 50/50 uniform (active experimentation)
"harm" segments → 100% control (with ε exploration floor)
Within active segments, assignment is uniform 50/50. This preserves maximum statistical power per observation and avoids propensity correction. The “targeting” is in segment selection, not assignment biasing.
This is essentially adaptive enrichment — the clinical-trial design pattern adapted to e-commerce.
The information bottleneck¶
Individual BCF tau trees are mostly stumps (α_tau=0.25, β_tau=3.0 means only 25% chance of splitting at root, 3% at depth 1). But the ensemble of 50 trees × 100+ MCMC draws captures a complex CATE surface through additive composition.
The policy tree is a deliberate compression layer: “given everything BCF learned, what are the 4-8 most important groups?” This is the coarse-personalization insight from Zhang & Misra 2022 quantified — 5 segments recover 99.5% of full personalization value in their food-delivery promo experiment. The remaining 0.5% buys enormous operational simplicity. Domain-specific result; transfer to other settings (especially zero-inflated) hasn’t been directly studied.
Why this matters for practice¶
The segment framing is how real experimentation teams operate:
Stability over time: coarse groups (e.g., “mobile users with >3 prior purchases”) are robust to population drift. Per-visitor propensities shift with every new visitor.
Cross-vertical actionability: segments can drive decisions beyond the storefront — email campaigns, pricing changes, inventory allocation. “High-engagement desktop users benefit from our premium layout” is actionable across channels.
Progressive confidence: early rounds use heavy control allocation (discovery). As segment effects stabilize, control fraction decreases (monitoring). The transition from discovery to exploitation is a configurable schedule, not a binary switch.
Core Loop¶
The per-round estimator is the multi-arm joint hurdle BCF
(Multi-arm hurdle BCF): a single shared prognostic forest plus
a tau forest whose leaves carry a (K−1) contrast vector, yielding a
joint posterior over all treatment-vs-control contrasts per visitor.
The sequential engine calls the same public analysis primitives that
power users call directly — there is no private variant. For the type
hierarchy see Result objects; for a hands-on walkthrough of each
primitive see
Working with the posterior.
Round 1: uniform allocation across all arms
→ pt.fit(observed, pooling="joint")
returns HurdleBCFResult (per-visitor rpv_cate_samples)
→ posterior.fit_policy_tree(...)
fits policy tree on cumulative per-visitor CATEs
→ initial segment-to-arm mapping
Round r>1: classify new visitors into segments from round r-1
→ each segment → assigned to its best arm (or control)
→ "uncertain" segments: exploration allocation
→ observe outcomes
→ pt.fit(accumulated_observed, pooling="joint")
fits on ALL accumulated data → updated HurdleBCFResult
→ posterior.fit_policy_tree(...)
refit policy tree → updated segment-to-arm mapping
Final: converged segments with stable per-arm treatment rules
Per-round primitives in order:
Estimator fit —
pt.fit(observed, pooling="joint")(equivalentlyfit_hurdle_bcf(observed, pooling="joint")) takes anObservedExperimentDataand returns aHurdleBCFResult. The engine accumulates all data across rounds before each fit; the result carriesrpv_cate_samples: (n, S_total, K−1)— one contrast column per treatment vs control.Segmentation —
posterior.fit_policy_tree(max_depth=..., min_segment_share=...)fits a policy tree on the per-visitor posterior-mean CATE vectors, returning aPolicyTreeResultwithsegments: list[DiscoveredSegment]. Each segment carriesstability_score(bootstrap replicability),gate_estimate, andarm_best_probabilitieskeyed by all variant names including control.Allocation —
posterior.thompson_allocation(segments, epsilon=...). Per-leaf allocation is computed from the fraction of posterior draws in which each arm is best under the shared best-arm rule:best_arm(δ) = 0(control) ifmax_j δ_j ≤ 0, elseargmax_j δ_j + 1. Control is a first-class winner — it takes a draw exactly when every contrast is non-positive. Theepsilonkwarg is the internal Thompson safety-net floor (ε/Kper active treatment). It is NOT the operator-facing controls-retention dial; controls retention at the L1 surface ismin_control_weight/min_explore_weightonpt.sequential_experiment(...).Calibration application —
posterior.apply_calibration(calibration)returns a new posterior of the same type with the R(p) + scale-family correction applied. K=2 only in v0.2; K≥3 per-contrast calibration raisesNotImplementedErroruntil the per-contrast SBC machinery ships with the sequential-surface calibration work.Ship / stop / continue —
posterior.recommendation_summary( treatment, segment)returns aRecommendationSummarywithdecision,expected_loss_comparison,probability_positive,probability_better,probability_harmful, andexpected_value_of_one_more_round.expected_value_of_one_more_roundis the information-theoretic value of running one additional round at the same per-round n. Near-zero means the experiment has converged and additional data is unlikely to change the decision.
Key design decisions¶
Data accumulation: BCF is refit on all data across rounds, not just the latest round. This gives maximum sample size for estimation but requires tracking per-visitor propensities (which vary by round and segment membership). Within active segments, propensity = 0.5. Within dropped segments, propensity = ε/2.
Assignment through the generator’s own hook: generate_v2_core
decomposes into sample_features → compute_potential_outcomes → assign_and_observe → build_bundle. The sim-mode adapter computes
policy-routed assignments from the round plan’s cells and drives
assign_and_observe through its external treatment_assignment hook —
one generation path, with the realized per-visitor assignment
propensities recorded alongside the data.
Epsilon exploration floor: “dropped” segments still get a small ε
fraction treated. This allows detecting if the treatment effect changed
(e.g., the population shifted, or a product change made the treatment
beneficial for a previously-harmed group). ε=0.10 as default, sweepable.
This is the ε passed to posterior.thompson_allocation(epsilon=...) —
the within-Thompson safety net. The cell-level controls-retention floor
(the L1 operator dial) is separate.
Segment discovery vs. bandit optimization¶
This system is not a contextual bandit. The goal is not “optimal per-visitor allocation to maximize cumulative reward.” The goal is: discover stable, interpretable population segments with consistent treatment responses, then evolve the policy tree that routes them.
Contextual bandit |
Segment discovery |
|
|---|---|---|
Unit |
individual visitor |
segment (country × device × engagement) |
Output |
per-visitor propensity (opaque) |
policy tree rules (interpretable) |
Converges to |
optimal allocation function |
stable segment definitions |
Actionable by |
real-time personalization engine |
merchandising team, cross-channel |
The policy tree’s job is to produce rules like “high-engagement mobile users in DE respond to premium layout” — actionable across storefront, email, pricing. Per-visitor propensities can’t drive these decisions.
This distinction determines the allocation basis (see below).
Allocation basis: segment-mean vs. individual¶
The allocation probability P(τ>0) can be computed two ways:
Segment-mean: P(segment-mean CATE > 0) across posterior draws. “Is treating this segment as a group net-positive?”
Individual: fraction of individual posterior samples > 0. “What share of visitors in this segment have positive CATE?”
For segment discovery, segment-mean is correct. A segment where 75% of visitors benefit (individual P(τ>0) ≈ 0.75) but the segment-mean CATE is clearly positive (segment-mean P(τ>0) ≈ 0.95) should be treated decisively — the within-segment heterogeneity is resolved by refining the tree (splitting further), not by setting allocation to 0.75.
The individual-fraction approach produces the “knows but won’t commit” problem: segments with large true effects and narrow CIs still get mushy p_treat values (0.6-0.8) because the segment is internally mixed. Segment-mean pushes these to near 0 or 1, matching the binary treat/hold decision the system is designed to make.
The individual-fraction approach was retained as a research baseline during validation; segment-mean is what ships.
The cell model¶
At the operator surface this three-way split is first-class: each round
ships a list of Cell objects (weight + assignment Policy):
Control cell (
BaselinePolicy): no treatment — clean counterfactual baseline. Never falls belowmin_control_weight.Explore cell (
UniformPolicy): randomized across treatments — clean RCT signal for HTE discovery. Never falls belowmin_explore_weight.Optimized cell (
TreePolicy): policy-tree-routed treatment — value capture from learned segments.
The Optimized cell has full expressive power: different segments get
different treatments (not just treat/hold). Inside it, the Thompson
epsilon controls how close per-segment allocation can get to 0 or 1
for any arm — higher epsilon means propensities closer to uniform-K.
The cell weights are the operator-facing dial; epsilon is the
within-cell safety net. Operators can add their own hypothesis cells
alongside these three (see the
injecting your own treatment hypotheses
tutorial).
Metrics¶
Metric |
Source |
What it answers |
|---|---|---|
Expected lift |
True p0/p1/m0/m1 |
“How good is the policy, ignoring noise?” |
Realized lift |
Observed revenue |
“What does a practitioner actually see?” |
Oracle lift |
Treat iff true_tau > 0 |
“Upper bound with perfect information” |
Uniform lift |
ATE from 50/50 |
“Baseline: what if we didn’t target at all?” |
Targeting regret |
Oracle − realized |
“How far from optimal per round?” |
Cumulative regret |
Sum across rounds |
“Does learning converge?” (should flatten) |
CATE RMSE |
Per-visitor estimates |
“How accurate are individual CATEs?” |
GATE RMSE |
Per-quartile group averages |
“How accurate are group-level effects?” |
Segment stability |
Jaccard similarity |
“Are segment boundaries converging?” |
Segment count |
Active segments per round |
“How complex is the targeting rule?” |
Policy accuracy |
Fraction matching oracle |
“What % of visitors get correct assignment?” |
Expected vs. realized lift: Expected lift isolates policy quality by using true potential outcomes. Realized lift includes sampling noise. Both should increase over rounds, with realized lift converging toward expected.
GATE RMSE is the leading indicator for targeting quality: if the estimator correctly identifies which segments benefit vs. are harmed, the policy tree will make good decisions regardless of individual CATE noise. This is why Joint BCF’s GATE RMSE superiority (7/8 scenarios in benchmark) matters more than Proto’s individual ranking advantage.
Multi-Arm Targeting¶
The multi-arm joint hurdle BCF (Multi-arm hurdle BCF) fits a
single prognostic forest shared across all arms and a tau forest whose
leaves carry a (K−1) contrast vector — one contrast per treatment vs.
control. The joint posterior over contrasts is what makes the policy tree’s
per-arm assignment calibrated: the best-arm rule is applied within each
correlated posterior draw, so winner’s-curse bias is eliminated.
This matches real-world experimentation practice: “free shipping” vs. “10% discount” vs. “bundle offer” — different visitor segments may respond to different promotions. The policy tree produces rules like:
IF recency > 30 days:
→ low_promo ("show discount")
ELSE IF order_count > 5:
→ free_ship ("free shipping offer")
ELSE:
→ control (more data needed)
The per-round composition using the public L2 primitives:
# Round 1: uniform allocation across arms
# (control, low_promo, free_ship, ...)
posterior = pt.fit(observed, pooling="joint")
# HurdleBCFResult; rpv_cate_samples: (n, S_total, K-1)
tree = posterior.fit_policy_tree(max_depth=3)
# PolicyTreeResult; tree.segments is the segment-to-arm mapping
allocation = posterior.thompson_allocation(segments=tree.segments)
# dict[leaf_id, dict[treatment_name, weight]]
# best_arm(δ) = 0 if max_j δ_j ≤ 0 else argmax_j δ_j + 1
# Round r: each segment → matched to its best arm
# control fraction decays: 50% → 30% → 10%
Progressive control decay: early rounds keep heavy control allocation for discovery. As estimates stabilize (measured by segment stability and GATE RMSE convergence), control fraction decreases. Control shifts from “discovery” to “monitoring” — a configurable schedule rather than a binary switch.
The policy tree labels at K≥3 are computed by applying the shared best-arm
rule to each visitor’s posterior-mean contrast vector; the sklearn tree is
then a multiclass classifier routing future visitors to arms. arm_best_probabilities
on each DiscoveredSegment carries the per-draw frequencies that drove
the Thompson allocation — useful for spotting ambivalent segments (near-tied
leaders) vs. decisive ones. See
Working with the posterior §2–3
for a worked example of both.
Scenario Priority¶
From the 8 validated templates, these best showcase sequential targeting:
reversal — CATE flips sign across visitor features. 50/50 benefiters vs. harmed. Perfect targeting = 2x uniform lift. Best showcase: the policy tree must correctly separate groups.
sparse_benefit — only ~15% of visitors benefit. Targeting concentrates resources on the minority. Tests selectivity.
monotone_gradient — smooth CATE gradient across feature space. Tests how policies sharpen segment boundaries over rounds.
constant — homogeneous treatment effect. Targeting should NOT help. Sanity check: cumulative regret ≈ 0, no spurious segment discovery.
Benchmark Foundation¶
Results from the 8-scenario calibration run (March 2026, n=50k, Joint hurdle BCF vs. Proto vs. ST-independent):
Scenario |
Joint GATE RMSE |
Proto GATE RMSE |
Winner |
|---|---|---|---|
constant |
0.03 |
0.10 |
Joint |
reversal |
0.32 |
0.52 |
Joint |
sparse_benefit |
0.13 |
0.21 |
Joint |
monotone_gradient |
0.04 |
0.47 |
Joint |
nonlinear_interaction |
0.09 |
0.06 |
Proto |
partial_null |
0.09 |
0.22 |
Joint |
high_noise |
0.20 |
0.44 |
Joint |
clustered |
0.36 |
0.37 |
Joint |
Joint BCF wins GATE RMSE 7/8 scenarios. ST-independent retired (loses 0/8). Proto retains advantage in per-visitor ranking for some scenarios.
For sequential targeting, GATE RMSE is the relevant metric — it measures how accurately the estimator identifies group-level effects, which directly determines policy tree quality.
Performance at Scale¶
The canonical adaptive tutorial doubles as the reference benchmark: three rounds of 50k/100k/200k visitors — cumulative fits at 50k, 150k, and 350k — with K = 3 arms at the library-default fit configuration, on a single workstation GPU (Quadro RTX 5000, 16 GB). Measured across repeated runs (GPU fits are not bit-reproducible, so treat these as bands, not pins):
Cumulative n |
Round wall-clock |
Peak GPU memory |
|---|---|---|
50k |
~2 min |
~0.7 GiB |
150k |
~4–4.5 min |
~1.0 GiB |
350k |
~8–10 min |
~2.3 GiB |
The whole three-round experiment runs end-to-end in roughly 15 minutes. Round wall-clock covers the full per-round pipeline — the BCF fit dominates; segment discovery, recommendation, and truth comparison are seconds.
How that scales:
Below n ≈ 50k, wall-clock is flat. Fixed costs (JIT compilation, chain warmup) dominate, so small fits all cost on the order of a couple of minutes regardless of n.
Beyond the flat region, round time grows roughly linearly with cumulative n — about 1.1–1.6 s per 1,000 visitors at K = 3 defaults on this card. The upper end of that band is not pure data volume: the loop deliberately grows the tau forest as evidence accumulates (the adaptive
num_trees_tausizing), so later rounds buy more model capacity along with more data.Arms multiply cost mildly. At n = 50k, a K = 3 fit costs ~1.4× the K = 2 paired fit (fit-only, same configuration, per-K compilation included).
Memory is not the binding constraint at these scales. 350k cumulative visitors peak near 2.3 GiB — a 16 GB consumer card has headroom for cumulative fits well past a million visitors before memory planning is needed.
Pre-release Empirical Findings¶
These findings come from the pre-release research loop that validated
the approach (since superseded by pt.sequential_experiment as the one
public surface). The evidence stands; the tooling it names is
historical.
Effect scale regime¶
effect_scale controls CATE magnitude in the scenario templates. At es=1.0
(the stress-test regime designed for BCF benchmarking), the signal-to-noise
ratio is unrealistically favorable — round 1→2 captures nearly all oracle
lift with no room for the flywheel to demonstrate progressive improvement.
At realistic e-commerce scales (es=0.10–0.20, baseline_rate=0.03), the
reversal template’s heterogeneity vanishes because effect_scale currently
controls both the average treatment effect and heterogeneity amplitude via a
single multiplier on logit-scale slopes. The clustered template is the
best base for targeting simulation because its cluster structure preserves
directional heterogeneity (38% negative-τ visitors) across all effect scales.
Open design issue: effect_scale conflates main effect and heterogeneity
amplitude. A two-knob parameterization (separate main_delta and hte_scale)
would allow independent control. The clustered template naturally provides
some of this via its cluster structure, but the logit-additive templates
(reversal, monotone_gradient, interaction_only) need the fix.
Scenario catalog next step: extend MixtureConfig to support richer
cluster structures for realistic targeting simulation:
4-6 clusters with distinct treatment response profiles (benefit, neutral, harmed — not just 2 groups)
Per-cluster effect vectors (different conversion and severity responses)
Separate
main_delta(average treatment effect) andhte_scale(heterogeneity amplitude) knobsEffect magnitudes calibrated in probability space, not logit scale The existing
interaction_onlyandnonlineartemplates already cover multi-feature effect surfaces — the gap is in group structure complexity and independent effect parameterization, not interaction patterns.
Flywheel convergence (clustered, es=0.20, baseline_rate=0.03)¶
With fixed 10k visitors/round, depth 2:
Round 1: uniform allocation, GATE RMSE 0.53, oracle gap 0.14
Round 3: policy captures ~75% of oracle gap, GATE RMSE 0.09
Rounds 4-5: stabilize, GATE RMSE ~0.05
The flywheel works: estimation improves → policy improves → data becomes more informative. But depth-2 can’t fully close the oracle gap (treats ~90% when oracle says ~61% benefit) because 4 segments can’t separate the underlying cluster structure finely enough.
Visitor scheduling¶
The batched-bandit literature (Perchet et al. 2016, arXiv 1505.00369; Esfandiari et al. AAAI 2021; Che & Namkoong 2023, arXiv 2303.11582) establishes that Ω(log log T) batches with geometric/doubling batch sizes suffice to match a fully sequential bandit’s regret — i.e., very few rounds are theoretically necessary. In e-commerce, doubling visitors means doubling round duration — feasible for optimization programs spanning months.
Geometric scheduling (round k gets n_base × 2^k visitors) is
first-class at the public surface (GeometricSchedule). Its
theoretical advantage requires the base batch to be large enough for
the BCF to produce informative posteriors. At 3% baseline conversion,
n_base < 5000 yields too few events (~30 treated conversions) for
reliable CATE estimation.
Adaptive tree depth (open question)¶
Deeper policy trees can capture finer heterogeneity but need more data to
avoid overfitting noisy CATE estimates. Sample-size-based depth rules
(depth = log2(N / n_min_leaf)) are too crude — they ignore estimation
quality and can produce overfit deep trees on noisy early-round CATEs.
The principled approach is cross-validated policy value: for each candidate depth, evaluate held-out expected RPV under that policy. This automatically adapts to signal strength rather than just data volume. Not yet implemented.