Sequential Targeting via Segment-Based Enrichment

Design rationale for pytyche’s sequential surface: why the HTE estimation → segment discovery → targeted allocation → re-estimation flywheel compounds value over batched e-commerce experiments. The operator surface for this loop is pt.sequential_experiment(...) — the first adaptive experiment tutorial walks it end-to-end; this doc explains why it’s shaped the way it is.

Problem Statement

We have a validated HTE estimation pipeline: joint hurdle BCF recovers per-visitor CATEs decomposed into conversion and AOV channels. The 8-scenario benchmark (March 2026, n=50k) shows Joint BCF wins GATE RMSE 7/8 scenarios — meaning it correctly identifies which groups benefit vs. are harmed by treatment.

The missing proof: does acting on those estimates actually help? If we target treatment to estimated benefiters and withhold from estimated non-benefiters, does realized lift exceed uniform allocation? Does the improvement compound over sequential experiments as estimates sharpen?

Nobody has demonstrated this loop empirically for batched e-commerce experiments on zero-inflated revenue. The closest work operates in different settings (online bandits, single-shot policy evaluation, continuous outcomes).

Literature Positioning

What exists

Contextual bandits with HTE oracles (Carranza, Krishnamurthy, Athey): Connects contextual-bandit regret to HTE estimation quality — showing CATE oracles are more sample-efficient than full reward modeling. But: online/single-visitor updates, not batched experiments. Continuous outcomes, not zero-inflated.

Batched bandit theory (Perchet et al. 2016, arXiv 1505.00369; Esfandiari et al. AAAI 2021; Che & Namkoong 2023, arXiv 2303.11582): Establishes that Ω(log log T) batches suffice to match the regret of a fully sequential bandit, with geometric/doubling batch sizes. Che & Namkoong 2023 specifically targets adaptive experimentation at scale with flexible batches + delayed feedback. The theoretical anchor for pytyche’s geometric visitor scheduling.

R-learner / causal forests (Nie & Wager 2021, Wager & Athey 2018): Foundation for CATE estimation with theoretical guarantees (asymptotic normality, honesty). The R-learner’s residual-on-residual approach based on Frisch-Waugh-Lovell is elegant. But: single-experiment estimators, no sequential targeting loop. Continuous outcomes only.

Coarse personalization via segment compression (Zhang & Misra 2022, “Coarse Personalization,” arXiv 2204.05793, EC ‘24): A food-delivery promotions field experiment found that 5 discrete segments recover 99.5% of full personalization value — the policy tree is a near-lossless compression of the CATE surface. But: single experiment, off-the-shelf methods, continuous outcomes, no sequential loop, no hurdle decomposition. Domain-specific result; counter-evidence in Shchetkina 2024 (arXiv 2411.16552, withdrawn) reports 18% vs 4% personalization gains depending on population conditions, suggesting the 99.5% transfers cleanly to promo-like settings but isn’t universal.

Single-shot policy evaluation methodology (Hitsch, Misra & Zhang 2024, QME 22:115-168): Canonical marketing-science framing for off-policy evaluation of arbitrary targeting policies from a single RCT. Related to the above but addresses how to evaluate policy trees, not the empirical 99.5% finding.

Adaptive enrichment (clinical-trial designs): Adaptive-enrichment trial designs progressively narrow trial populations to responsive subgroups. Closest to our loop conceptually: estimate → select → re-estimate. But: very different statistical setting (survival/binary outcomes, small samples, phase II/III drugs).

The gap

Setting

Prior work

Our contribution

Single-shot HTE + targeting

Zhang & Misra 2022; Hitsch/Misra/Zhang 2024

Online sequential targeting

Carranza/Krishnamurthy/Athey

Batched sequential bandit theory

Perchet 2016 / Esfandiari 2021 / Che & Namkoong 2023

Batched sequential HTE targeting

Gap

This work

Zero-inflated revenue outcomes

Gap

This work

Segment-based enrichment for e-commerce

Gap

This work

The contribution is compositional: batched sequential targeting + hurdle BCF + segment-based enrichment + e-commerce application. No individual component is novel, but the combination and empirical demonstration on zero-inflated revenue hasn’t been done.

Approach: Segment-Based Enrichment

Why not per-visitor propensities?

The straightforward approach is: estimate τ̂(x) per visitor → set treatment probability proportional to estimated benefit. This is theoretically optimal (maximizes expected welfare) but practically problematic:

  1. Instability: individual CATE estimates are noisy; small changes in estimation produce large swings in per-visitor propensities across rounds, making the system erratic.

  2. Operational opacity: a 50k-row propensity table is impossible for a merchandising team to act on. You can’t brief stakeholders on “visitor #34,721 gets p=0.73.”

  3. Statistical complexity: non-uniform propensities require inverse probability weighting for valid estimation, introducing variance inflation and coverage issues.

  4. Scope limitation: per-visitor propensities only work for real-time storefront personalization. They can’t drive cross-vertical business decisions (“should we also change email targeting for this segment?”).

Segment-based enrichment model

Instead, we fit a shallow decision tree (depth 2-3) on the CATE estimates to produce human-interpretable segments. Each segment gets a binary decision: treat or control.

BCF ensemble (50 tau trees × 100+ MCMC draws)
     │
     │  posterior mean τ̂(x) per visitor
     ▼
Policy tree (single DecisionTreeRegressor, depth 2)
     │
     │  4-8 segments with feature-based rules
     ▼
Segment allocation:
  "benefit" segments → 50/50 uniform (active experimentation)
  "harm" segments    → 100% control (with ε exploration floor)

Within active segments, assignment is uniform 50/50. This preserves maximum statistical power per observation and avoids propensity correction. The “targeting” is in segment selection, not assignment biasing.

This is essentially adaptive enrichment — the clinical-trial design pattern adapted to e-commerce.

The information bottleneck

Individual BCF tau trees are mostly stumps (α_tau=0.25, β_tau=3.0 means only 25% chance of splitting at root, 3% at depth 1). But the ensemble of 50 trees × 100+ MCMC draws captures a complex CATE surface through additive composition.

The policy tree is a deliberate compression layer: “given everything BCF learned, what are the 4-8 most important groups?” This is the coarse-personalization insight from Zhang & Misra 2022 quantified — 5 segments recover 99.5% of full personalization value in their food-delivery promo experiment. The remaining 0.5% buys enormous operational simplicity. Domain-specific result; transfer to other settings (especially zero-inflated) hasn’t been directly studied.

Why this matters for practice

The segment framing is how real experimentation teams operate:

  1. Stability over time: coarse groups (e.g., “mobile users with >3 prior purchases”) are robust to population drift. Per-visitor propensities shift with every new visitor.

  2. Cross-vertical actionability: segments can drive decisions beyond the storefront — email campaigns, pricing changes, inventory allocation. “High-engagement desktop users benefit from our premium layout” is actionable across channels.

  3. Progressive confidence: early rounds use heavy control allocation (discovery). As segment effects stabilize, control fraction decreases (monitoring). The transition from discovery to exploitation is a configurable schedule, not a binary switch.

Core Loop

The per-round estimator is the multi-arm joint hurdle BCF (Multi-arm hurdle BCF): a single shared prognostic forest plus a tau forest whose leaves carry a (K−1) contrast vector, yielding a joint posterior over all treatment-vs-control contrasts per visitor. The sequential engine calls the same public analysis primitives that power users call directly — there is no private variant. For the type hierarchy see Result objects; for a hands-on walkthrough of each primitive see Working with the posterior.

Round 1: uniform allocation across all arms
         → pt.fit(observed, pooling="joint")
             returns HurdleBCFResult (per-visitor rpv_cate_samples)
         → posterior.fit_policy_tree(...)
             fits policy tree on cumulative per-visitor CATEs
             → initial segment-to-arm mapping

Round r>1: classify new visitors into segments from round r-1
           → each segment → assigned to its best arm (or control)
           → "uncertain" segments: exploration allocation
           → observe outcomes
           → pt.fit(accumulated_observed, pooling="joint")
               fits on ALL accumulated data → updated HurdleBCFResult
           → posterior.fit_policy_tree(...)
               refit policy tree → updated segment-to-arm mapping

Final:   converged segments with stable per-arm treatment rules

Per-round primitives in order:

  1. Estimator fitpt.fit(observed, pooling="joint") (equivalently fit_hurdle_bcf(observed, pooling="joint")) takes an ObservedExperimentData and returns a HurdleBCFResult. The engine accumulates all data across rounds before each fit; the result carries rpv_cate_samples: (n, S_total, K−1) — one contrast column per treatment vs control.

  2. Segmentationposterior.fit_policy_tree(max_depth=..., min_segment_share=...) fits a policy tree on the per-visitor posterior-mean CATE vectors, returning a PolicyTreeResult with segments: list[DiscoveredSegment]. Each segment carries stability_score (bootstrap replicability), gate_estimate, and arm_best_probabilities keyed by all variant names including control.

  3. Allocationposterior.thompson_allocation(segments, epsilon=...). Per-leaf allocation is computed from the fraction of posterior draws in which each arm is best under the shared best-arm rule: best_arm(δ) = 0 (control) if max_j δ_j 0, else argmax_j δ_j + 1. Control is a first-class winner — it takes a draw exactly when every contrast is non-positive. The epsilon kwarg is the internal Thompson safety-net floor (ε/K per active treatment). It is NOT the operator-facing controls-retention dial; controls retention at the L1 surface is min_control_weight / min_explore_weight on pt.sequential_experiment(...).

  4. Calibration applicationposterior.apply_calibration(calibration) returns a new posterior of the same type with the R(p) + scale-family correction applied. K=2 only in v0.2; K≥3 per-contrast calibration raises NotImplementedError until the per-contrast SBC machinery ships with the sequential-surface calibration work.

  5. Ship / stop / continueposterior.recommendation_summary( treatment, segment) returns a RecommendationSummary with decision, expected_loss_comparison, probability_positive, probability_better, probability_harmful, and expected_value_of_one_more_round. expected_value_of_one_more_round is the information-theoretic value of running one additional round at the same per-round n. Near-zero means the experiment has converged and additional data is unlikely to change the decision.

Key design decisions

Data accumulation: BCF is refit on all data across rounds, not just the latest round. This gives maximum sample size for estimation but requires tracking per-visitor propensities (which vary by round and segment membership). Within active segments, propensity = 0.5. Within dropped segments, propensity = ε/2.

Assignment through the generator’s own hook: generate_v2_core decomposes into sample_features compute_potential_outcomes assign_and_observe build_bundle. The sim-mode adapter computes policy-routed assignments from the round plan’s cells and drives assign_and_observe through its external treatment_assignment hook — one generation path, with the realized per-visitor assignment propensities recorded alongside the data.

Epsilon exploration floor: “dropped” segments still get a small ε fraction treated. This allows detecting if the treatment effect changed (e.g., the population shifted, or a product change made the treatment beneficial for a previously-harmed group). ε=0.10 as default, sweepable. This is the ε passed to posterior.thompson_allocation(epsilon=...) — the within-Thompson safety net. The cell-level controls-retention floor (the L1 operator dial) is separate.

Segment discovery vs. bandit optimization

This system is not a contextual bandit. The goal is not “optimal per-visitor allocation to maximize cumulative reward.” The goal is: discover stable, interpretable population segments with consistent treatment responses, then evolve the policy tree that routes them.

Contextual bandit

Segment discovery

Unit

individual visitor

segment (country × device × engagement)

Output

per-visitor propensity (opaque)

policy tree rules (interpretable)

Converges to

optimal allocation function

stable segment definitions

Actionable by

real-time personalization engine

merchandising team, cross-channel

The policy tree’s job is to produce rules like “high-engagement mobile users in DE respond to premium layout” — actionable across storefront, email, pricing. Per-visitor propensities can’t drive these decisions.

This distinction determines the allocation basis (see below).

Allocation basis: segment-mean vs. individual

The allocation probability P(τ>0) can be computed two ways:

  • Segment-mean: P(segment-mean CATE > 0) across posterior draws. “Is treating this segment as a group net-positive?”

  • Individual: fraction of individual posterior samples > 0. “What share of visitors in this segment have positive CATE?”

For segment discovery, segment-mean is correct. A segment where 75% of visitors benefit (individual P(τ>0) ≈ 0.75) but the segment-mean CATE is clearly positive (segment-mean P(τ>0) ≈ 0.95) should be treated decisively — the within-segment heterogeneity is resolved by refining the tree (splitting further), not by setting allocation to 0.75.

The individual-fraction approach produces the “knows but won’t commit” problem: segments with large true effects and narrow CIs still get mushy p_treat values (0.6-0.8) because the segment is internally mixed. Segment-mean pushes these to near 0 or 1, matching the binary treat/hold decision the system is designed to make.

The individual-fraction approach was retained as a research baseline during validation; segment-mean is what ships.

The cell model

At the operator surface this three-way split is first-class: each round ships a list of Cell objects (weight + assignment Policy):

  • Control cell (BaselinePolicy): no treatment — clean counterfactual baseline. Never falls below min_control_weight.

  • Explore cell (UniformPolicy): randomized across treatments — clean RCT signal for HTE discovery. Never falls below min_explore_weight.

  • Optimized cell (TreePolicy): policy-tree-routed treatment — value capture from learned segments.

The Optimized cell has full expressive power: different segments get different treatments (not just treat/hold). Inside it, the Thompson epsilon controls how close per-segment allocation can get to 0 or 1 for any arm — higher epsilon means propensities closer to uniform-K. The cell weights are the operator-facing dial; epsilon is the within-cell safety net. Operators can add their own hypothesis cells alongside these three (see the injecting your own treatment hypotheses tutorial).

Metrics

Metric

Source

What it answers

Expected lift

True p0/p1/m0/m1

“How good is the policy, ignoring noise?”

Realized lift

Observed revenue

“What does a practitioner actually see?”

Oracle lift

Treat iff true_tau > 0

“Upper bound with perfect information”

Uniform lift

ATE from 50/50

“Baseline: what if we didn’t target at all?”

Targeting regret

Oracle − realized

“How far from optimal per round?”

Cumulative regret

Sum across rounds

“Does learning converge?” (should flatten)

CATE RMSE

Per-visitor estimates

“How accurate are individual CATEs?”

GATE RMSE

Per-quartile group averages

“How accurate are group-level effects?”

Segment stability

Jaccard similarity

“Are segment boundaries converging?”

Segment count

Active segments per round

“How complex is the targeting rule?”

Policy accuracy

Fraction matching oracle

“What % of visitors get correct assignment?”

Expected vs. realized lift: Expected lift isolates policy quality by using true potential outcomes. Realized lift includes sampling noise. Both should increase over rounds, with realized lift converging toward expected.

GATE RMSE is the leading indicator for targeting quality: if the estimator correctly identifies which segments benefit vs. are harmed, the policy tree will make good decisions regardless of individual CATE noise. This is why Joint BCF’s GATE RMSE superiority (7/8 scenarios in benchmark) matters more than Proto’s individual ranking advantage.

Multi-Arm Targeting

The multi-arm joint hurdle BCF (Multi-arm hurdle BCF) fits a single prognostic forest shared across all arms and a tau forest whose leaves carry a (K−1) contrast vector — one contrast per treatment vs. control. The joint posterior over contrasts is what makes the policy tree’s per-arm assignment calibrated: the best-arm rule is applied within each correlated posterior draw, so winner’s-curse bias is eliminated.

This matches real-world experimentation practice: “free shipping” vs. “10% discount” vs. “bundle offer” — different visitor segments may respond to different promotions. The policy tree produces rules like:

IF recency > 30 days:
  → low_promo ("show discount")
ELSE IF order_count > 5:
  → free_ship ("free shipping offer")
ELSE:
  → control (more data needed)

The per-round composition using the public L2 primitives:

# Round 1: uniform allocation across arms
# (control, low_promo, free_ship, ...)

posterior = pt.fit(observed, pooling="joint")
# HurdleBCFResult; rpv_cate_samples: (n, S_total, K-1)

tree = posterior.fit_policy_tree(max_depth=3)
# PolicyTreeResult; tree.segments is the segment-to-arm mapping

allocation = posterior.thompson_allocation(segments=tree.segments)
# dict[leaf_id, dict[treatment_name, weight]]
# best_arm(δ) = 0 if max_j δ_j ≤ 0 else argmax_j δ_j + 1

# Round r: each segment → matched to its best arm
#          control fraction decays: 50% → 30% → 10%

Progressive control decay: early rounds keep heavy control allocation for discovery. As estimates stabilize (measured by segment stability and GATE RMSE convergence), control fraction decreases. Control shifts from “discovery” to “monitoring” — a configurable schedule rather than a binary switch.

The policy tree labels at K≥3 are computed by applying the shared best-arm rule to each visitor’s posterior-mean contrast vector; the sklearn tree is then a multiclass classifier routing future visitors to arms. arm_best_probabilities on each DiscoveredSegment carries the per-draw frequencies that drove the Thompson allocation — useful for spotting ambivalent segments (near-tied leaders) vs. decisive ones. See Working with the posterior §2–3 for a worked example of both.

Scenario Priority

From the 8 validated templates, these best showcase sequential targeting:

  1. reversal — CATE flips sign across visitor features. 50/50 benefiters vs. harmed. Perfect targeting = 2x uniform lift. Best showcase: the policy tree must correctly separate groups.

  2. sparse_benefit — only ~15% of visitors benefit. Targeting concentrates resources on the minority. Tests selectivity.

  3. monotone_gradient — smooth CATE gradient across feature space. Tests how policies sharpen segment boundaries over rounds.

  4. constant — homogeneous treatment effect. Targeting should NOT help. Sanity check: cumulative regret ≈ 0, no spurious segment discovery.

Benchmark Foundation

Results from the 8-scenario calibration run (March 2026, n=50k, Joint hurdle BCF vs. Proto vs. ST-independent):

Scenario

Joint GATE RMSE

Proto GATE RMSE

Winner

constant

0.03

0.10

Joint

reversal

0.32

0.52

Joint

sparse_benefit

0.13

0.21

Joint

monotone_gradient

0.04

0.47

Joint

nonlinear_interaction

0.09

0.06

Proto

partial_null

0.09

0.22

Joint

high_noise

0.20

0.44

Joint

clustered

0.36

0.37

Joint

Joint BCF wins GATE RMSE 7/8 scenarios. ST-independent retired (loses 0/8). Proto retains advantage in per-visitor ranking for some scenarios.

For sequential targeting, GATE RMSE is the relevant metric — it measures how accurately the estimator identifies group-level effects, which directly determines policy tree quality.

Performance at Scale

The canonical adaptive tutorial doubles as the reference benchmark: three rounds of 50k/100k/200k visitors — cumulative fits at 50k, 150k, and 350k — with K = 3 arms at the library-default fit configuration, on a single workstation GPU (Quadro RTX 5000, 16 GB). Measured across repeated runs (GPU fits are not bit-reproducible, so treat these as bands, not pins):

Cumulative n

Round wall-clock

Peak GPU memory

50k

~2 min

~0.7 GiB

150k

~4–4.5 min

~1.0 GiB

350k

~8–10 min

~2.3 GiB

The whole three-round experiment runs end-to-end in roughly 15 minutes. Round wall-clock covers the full per-round pipeline — the BCF fit dominates; segment discovery, recommendation, and truth comparison are seconds.

How that scales:

  • Below n ≈ 50k, wall-clock is flat. Fixed costs (JIT compilation, chain warmup) dominate, so small fits all cost on the order of a couple of minutes regardless of n.

  • Beyond the flat region, round time grows roughly linearly with cumulative n — about 1.1–1.6 s per 1,000 visitors at K = 3 defaults on this card. The upper end of that band is not pure data volume: the loop deliberately grows the tau forest as evidence accumulates (the adaptive num_trees_tau sizing), so later rounds buy more model capacity along with more data.

  • Arms multiply cost mildly. At n = 50k, a K = 3 fit costs ~1.4× the K = 2 paired fit (fit-only, same configuration, per-K compilation included).

  • Memory is not the binding constraint at these scales. 350k cumulative visitors peak near 2.3 GiB — a 16 GB consumer card has headroom for cumulative fits well past a million visitors before memory planning is needed.

Pre-release Empirical Findings

These findings come from the pre-release research loop that validated the approach (since superseded by pt.sequential_experiment as the one public surface). The evidence stands; the tooling it names is historical.

Effect scale regime

effect_scale controls CATE magnitude in the scenario templates. At es=1.0 (the stress-test regime designed for BCF benchmarking), the signal-to-noise ratio is unrealistically favorable — round 1→2 captures nearly all oracle lift with no room for the flywheel to demonstrate progressive improvement.

At realistic e-commerce scales (es=0.10–0.20, baseline_rate=0.03), the reversal template’s heterogeneity vanishes because effect_scale currently controls both the average treatment effect and heterogeneity amplitude via a single multiplier on logit-scale slopes. The clustered template is the best base for targeting simulation because its cluster structure preserves directional heterogeneity (38% negative-τ visitors) across all effect scales.

Open design issue: effect_scale conflates main effect and heterogeneity amplitude. A two-knob parameterization (separate main_delta and hte_scale) would allow independent control. The clustered template naturally provides some of this via its cluster structure, but the logit-additive templates (reversal, monotone_gradient, interaction_only) need the fix.

Scenario catalog next step: extend MixtureConfig to support richer cluster structures for realistic targeting simulation:

  • 4-6 clusters with distinct treatment response profiles (benefit, neutral, harmed — not just 2 groups)

  • Per-cluster effect vectors (different conversion and severity responses)

  • Separate main_delta (average treatment effect) and hte_scale (heterogeneity amplitude) knobs

  • Effect magnitudes calibrated in probability space, not logit scale The existing interaction_only and nonlinear templates already cover multi-feature effect surfaces — the gap is in group structure complexity and independent effect parameterization, not interaction patterns.

Flywheel convergence (clustered, es=0.20, baseline_rate=0.03)

With fixed 10k visitors/round, depth 2:

  • Round 1: uniform allocation, GATE RMSE 0.53, oracle gap 0.14

  • Round 3: policy captures ~75% of oracle gap, GATE RMSE 0.09

  • Rounds 4-5: stabilize, GATE RMSE ~0.05

The flywheel works: estimation improves → policy improves → data becomes more informative. But depth-2 can’t fully close the oracle gap (treats ~90% when oracle says ~61% benefit) because 4 segments can’t separate the underlying cluster structure finely enough.

Visitor scheduling

The batched-bandit literature (Perchet et al. 2016, arXiv 1505.00369; Esfandiari et al. AAAI 2021; Che & Namkoong 2023, arXiv 2303.11582) establishes that Ω(log log T) batches with geometric/doubling batch sizes suffice to match a fully sequential bandit’s regret — i.e., very few rounds are theoretically necessary. In e-commerce, doubling visitors means doubling round duration — feasible for optimization programs spanning months.

Geometric scheduling (round k gets n_base × 2^k visitors) is first-class at the public surface (GeometricSchedule). Its theoretical advantage requires the base batch to be large enough for the BCF to produce informative posteriors. At 3% baseline conversion, n_base < 5000 yields too few events (~30 treated conversions) for reliable CATE estimation.

Adaptive tree depth (open question)

Deeper policy trees can capture finer heterogeneity but need more data to avoid overfitting noisy CATE estimates. Sample-size-based depth rules (depth = log2(N / n_min_leaf)) are too crude — they ignore estimation quality and can produce overfit deep trees on noisy early-round CATEs.

The principled approach is cross-validated policy value: for each candidate depth, evaluate held-out expected RPV under that policy. This automatically adapts to signal strength rather than just data volume. Not yet implemented.