--- title: "Sequential Targeting via Segment-Based Enrichment" review-state: drafting last-human-review: "2026-06-09" depends-on: - src/pytyche/experiment - src/pytyche/analysis owner: unowned quadrant: concept --- # Sequential Targeting via Segment-Based Enrichment Design rationale for pytyche's sequential surface: why the HTE estimation → segment discovery → targeted allocation → re-estimation flywheel compounds value over batched e-commerce experiments. The operator surface for this loop is `pt.sequential_experiment(...)` — the [first adaptive experiment](../tutorials/first-adaptive-experiment.md) tutorial walks it end-to-end; this doc explains why it's shaped the way it is. ## Problem Statement We have a validated HTE estimation pipeline: joint hurdle BCF recovers per-visitor CATEs decomposed into conversion and AOV channels. The 8-scenario benchmark (March 2026, n=50k) shows Joint BCF wins GATE RMSE 7/8 scenarios — meaning it correctly identifies *which groups* benefit vs. are harmed by treatment. The missing proof: **does acting on those estimates actually help?** If we target treatment to estimated benefiters and withhold from estimated non-benefiters, does realized lift exceed uniform allocation? Does the improvement compound over sequential experiments as estimates sharpen? Nobody has demonstrated this loop empirically for **batched** e-commerce experiments on **zero-inflated revenue**. The closest work operates in different settings (online bandits, single-shot policy evaluation, continuous outcomes). ## Literature Positioning ### What exists **Contextual bandits with HTE oracles (Carranza, Krishnamurthy, Athey):** Connects contextual-bandit regret to HTE estimation quality — showing CATE oracles are more sample-efficient than full reward modeling. But: online/single-visitor updates, not batched experiments. Continuous outcomes, not zero-inflated. **Batched bandit theory (Perchet et al. 2016, arXiv 1505.00369; Esfandiari et al. AAAI 2021; Che & Namkoong 2023, arXiv 2303.11582):** Establishes that **Ω(log log T) batches suffice** to match the regret of a fully sequential bandit, with geometric/doubling batch sizes. Che & Namkoong 2023 specifically targets adaptive experimentation at scale with flexible batches + delayed feedback. The theoretical anchor for pytyche's geometric visitor scheduling. **R-learner / causal forests (Nie & Wager 2021, Wager & Athey 2018):** Foundation for CATE estimation with theoretical guarantees (asymptotic normality, honesty). The R-learner's residual-on-residual approach based on Frisch-Waugh-Lovell is elegant. But: single-experiment estimators, no sequential targeting loop. Continuous outcomes only. **Coarse personalization via segment compression (Zhang & Misra 2022, "Coarse Personalization," arXiv 2204.05793, EC '24):** A food-delivery promotions field experiment found that **5 discrete segments recover 99.5% of full personalization value** — the policy tree is a near-lossless compression of the CATE surface. But: single experiment, off-the-shelf methods, continuous outcomes, no sequential loop, no hurdle decomposition. Domain-specific result; counter-evidence in Shchetkina 2024 (arXiv 2411.16552, withdrawn) reports 18% vs 4% personalization gains depending on population conditions, suggesting the 99.5% transfers cleanly to promo-like settings but isn't universal. **Single-shot policy evaluation methodology (Hitsch, Misra & Zhang 2024, QME 22:115-168):** Canonical marketing-science framing for off-policy evaluation of arbitrary targeting policies from a single RCT. Related to the above but addresses *how* to evaluate policy trees, not the empirical 99.5% finding. **Adaptive enrichment (clinical-trial designs):** Adaptive-enrichment trial designs progressively narrow trial populations to responsive subgroups. Closest to our loop conceptually: estimate → select → re-estimate. But: very different statistical setting (survival/binary outcomes, small samples, phase II/III drugs). ### The gap | Setting | Prior work | Our contribution | |---|---|---| | Single-shot HTE + targeting | Zhang & Misra 2022; Hitsch/Misra/Zhang 2024 | — | | Online sequential targeting | Carranza/Krishnamurthy/Athey | — | | Batched sequential bandit theory | Perchet 2016 / Esfandiari 2021 / Che & Namkoong 2023 | — | | Batched sequential HTE targeting | *Gap* | This work | | Zero-inflated revenue outcomes | *Gap* | This work | | Segment-based enrichment for e-commerce | *Gap* | This work | The contribution is compositional: batched sequential targeting + hurdle BCF + segment-based enrichment + e-commerce application. No individual component is novel, but the combination and empirical demonstration on zero-inflated revenue hasn't been done. ## Approach: Segment-Based Enrichment ### Why not per-visitor propensities? The straightforward approach is: estimate τ̂(x) per visitor → set treatment probability proportional to estimated benefit. This is theoretically optimal (maximizes expected welfare) but practically problematic: 1. **Instability**: individual CATE estimates are noisy; small changes in estimation produce large swings in per-visitor propensities across rounds, making the system erratic. 2. **Operational opacity**: a 50k-row propensity table is impossible for a merchandising team to act on. You can't brief stakeholders on "visitor #34,721 gets p=0.73." 3. **Statistical complexity**: non-uniform propensities require inverse probability weighting for valid estimation, introducing variance inflation and coverage issues. 4. **Scope limitation**: per-visitor propensities only work for real-time storefront personalization. They can't drive cross-vertical business decisions ("should we also change email targeting for this segment?"). ### Segment-based enrichment model Instead, we fit a **shallow decision tree** (depth 2-3) on the CATE estimates to produce **human-interpretable segments**. Each segment gets a binary decision: treat or control. ``` BCF ensemble (50 tau trees × 100+ MCMC draws) │ │ posterior mean τ̂(x) per visitor ▼ Policy tree (single DecisionTreeRegressor, depth 2) │ │ 4-8 segments with feature-based rules ▼ Segment allocation: "benefit" segments → 50/50 uniform (active experimentation) "harm" segments → 100% control (with ε exploration floor) ``` **Within active segments, assignment is uniform 50/50.** This preserves maximum statistical power per observation and avoids propensity correction. The "targeting" is in segment *selection*, not assignment *biasing*. This is essentially **adaptive enrichment** — the clinical-trial design pattern adapted to e-commerce. ### The information bottleneck Individual BCF tau trees are mostly **stumps** (α_tau=0.25, β_tau=3.0 means only 25% chance of splitting at root, 3% at depth 1). But the ensemble of 50 trees × 100+ MCMC draws captures a complex CATE surface through additive composition. The policy tree is a deliberate compression layer: "given everything BCF learned, what are the 4-8 most important groups?" This is the coarse-personalization insight from Zhang & Misra 2022 quantified — 5 segments recover 99.5% of full personalization value in their food-delivery promo experiment. The remaining 0.5% buys enormous operational simplicity. Domain-specific result; transfer to other settings (especially zero-inflated) hasn't been directly studied. ### Why this matters for practice The segment framing is how real experimentation teams operate: 1. **Stability over time**: coarse groups (e.g., "mobile users with >3 prior purchases") are robust to population drift. Per-visitor propensities shift with every new visitor. 2. **Cross-vertical actionability**: segments can drive decisions beyond the storefront — email campaigns, pricing changes, inventory allocation. "High-engagement desktop users benefit from our premium layout" is actionable across channels. 3. **Progressive confidence**: early rounds use heavy control allocation (discovery). As segment effects stabilize, control fraction decreases (monitoring). The transition from discovery to exploitation is a configurable schedule, not a binary switch. ## Core Loop The per-round estimator is the **multi-arm joint hurdle BCF** ({doc}`multi-arm-hurdle-bcf`): a single shared prognostic forest plus a tau forest whose leaves carry a `(K−1)` contrast vector, yielding a joint posterior over all treatment-vs-control contrasts per visitor. The sequential engine calls the same public analysis primitives that power users call directly — there is no private variant. For the type hierarchy see {doc}`result-objects`; for a hands-on walkthrough of each primitive see [Working with the posterior](../tutorials/working-with-the-posterior.md). ``` Round 1: uniform allocation across all arms → pt.fit(observed, pooling="joint") returns HurdleBCFResult (per-visitor rpv_cate_samples) → posterior.fit_policy_tree(...) fits policy tree on cumulative per-visitor CATEs → initial segment-to-arm mapping Round r>1: classify new visitors into segments from round r-1 → each segment → assigned to its best arm (or control) → "uncertain" segments: exploration allocation → observe outcomes → pt.fit(accumulated_observed, pooling="joint") fits on ALL accumulated data → updated HurdleBCFResult → posterior.fit_policy_tree(...) refit policy tree → updated segment-to-arm mapping Final: converged segments with stable per-arm treatment rules ``` Per-round primitives in order: 1. **Estimator fit** — `pt.fit(observed, pooling="joint")` (equivalently `fit_hurdle_bcf(observed, pooling="joint")`) takes an `ObservedExperimentData` and returns a `HurdleBCFResult`. The engine accumulates all data across rounds before each fit; the result carries `rpv_cate_samples: (n, S_total, K−1)` — one contrast column per treatment vs control. 2. **Segmentation** — `posterior.fit_policy_tree(max_depth=..., min_segment_share=...)` fits a policy tree on the per-visitor posterior-mean CATE vectors, returning a `PolicyTreeResult` with `segments: list[DiscoveredSegment]`. Each segment carries `stability_score` (bootstrap replicability), `gate_estimate`, and `arm_best_probabilities` keyed by all variant names including control. 3. **Allocation** — `posterior.thompson_allocation(segments, epsilon=...)`. Per-leaf allocation is computed from the fraction of posterior draws in which each arm is best under the **shared best-arm rule**: `best_arm(δ) = 0` (control) if `max_j δ_j ≤ 0`, else `argmax_j δ_j + 1`. Control is a first-class winner — it takes a draw exactly when every contrast is non-positive. The `epsilon` kwarg is the internal Thompson safety-net floor (`ε/K` per active treatment). It is NOT the operator-facing controls-retention dial; controls retention at the L1 surface is `min_control_weight` / `min_explore_weight` on `pt.sequential_experiment(...)`. 4. **Calibration application** — `posterior.apply_calibration(calibration)` returns a new posterior of the same type with the R(p) + scale-family correction applied. K=2 only in v0.2; K≥3 per-contrast calibration raises `NotImplementedError` until the per-contrast SBC machinery ships with the sequential-surface calibration work. 5. **Ship / stop / continue** — `posterior.recommendation_summary( treatment, segment)` returns a `RecommendationSummary` with `decision`, `expected_loss_comparison`, `probability_positive`, `probability_better`, `probability_harmful`, and `expected_value_of_one_more_round`. `expected_value_of_one_more_round` is the information-theoretic value of running one additional round at the same per-round n. Near-zero means the experiment has converged and additional data is unlikely to change the decision. ### Key design decisions **Data accumulation**: BCF is refit on all data across rounds, not just the latest round. This gives maximum sample size for estimation but requires tracking per-visitor propensities (which vary by round and segment membership). Within active segments, propensity = 0.5. Within dropped segments, propensity = ε/2. **Assignment through the generator's own hook**: `generate_v2_core` decomposes into `sample_features → compute_potential_outcomes → assign_and_observe → build_bundle`. The sim-mode adapter computes policy-routed assignments from the round plan's cells and drives `assign_and_observe` through its external `treatment_assignment` hook — one generation path, with the realized per-visitor assignment propensities recorded alongside the data. **Epsilon exploration floor**: "dropped" segments still get a small ε fraction treated. This allows detecting if the treatment effect changed (e.g., the population shifted, or a product change made the treatment beneficial for a previously-harmed group). ε=0.10 as default, sweepable. This is the ε passed to `posterior.thompson_allocation(epsilon=...)` — the within-Thompson safety net. The cell-level controls-retention floor (the L1 operator dial) is separate. ### Segment discovery vs. bandit optimization This system is **not a contextual bandit**. The goal is not "optimal per-visitor allocation to maximize cumulative reward." The goal is: discover stable, interpretable population segments with consistent treatment responses, then evolve the policy tree that routes them. | | Contextual bandit | Segment discovery | |---|---|---| | Unit | individual visitor | segment (country × device × engagement) | | Output | per-visitor propensity (opaque) | policy tree rules (interpretable) | | Converges to | optimal allocation function | stable segment definitions | | Actionable by | real-time personalization engine | merchandising team, cross-channel | The policy tree's job is to produce rules like "high-engagement mobile users in DE respond to premium layout" — actionable across storefront, email, pricing. Per-visitor propensities can't drive these decisions. This distinction determines the allocation basis (see below). ### Allocation basis: segment-mean vs. individual The allocation probability P(τ>0) can be computed two ways: - **Segment-mean**: P(segment-mean CATE > 0) across posterior draws. "Is treating this segment *as a group* net-positive?" - **Individual**: fraction of individual posterior samples > 0. "What share of *visitors* in this segment have positive CATE?" For segment discovery, segment-mean is correct. A segment where 75% of visitors benefit (individual P(τ>0) ≈ 0.75) but the segment-mean CATE is clearly positive (segment-mean P(τ>0) ≈ 0.95) should be treated decisively — the within-segment heterogeneity is resolved by *refining the tree* (splitting further), not by setting allocation to 0.75. The individual-fraction approach produces the "knows but won't commit" problem: segments with large true effects and narrow CIs still get mushy p_treat values (0.6-0.8) because the segment is internally mixed. Segment-mean pushes these to near 0 or 1, matching the binary treat/hold decision the system is designed to make. The individual-fraction approach was retained as a research baseline during validation; segment-mean is what ships. ### The cell model At the operator surface this three-way split is first-class: each round ships a list of `Cell` objects (weight + assignment `Policy`): - **Control cell** (`BaselinePolicy`): no treatment — clean counterfactual baseline. Never falls below `min_control_weight`. - **Explore cell** (`UniformPolicy`): randomized across treatments — clean RCT signal for HTE discovery. Never falls below `min_explore_weight`. - **Optimized cell** (`TreePolicy`): policy-tree-routed treatment — value capture from learned segments. The Optimized cell has full expressive power: different segments get different treatments (not just treat/hold). Inside it, the Thompson `epsilon` controls how close per-segment allocation can get to 0 or 1 for any arm — higher epsilon means propensities closer to uniform-K. The cell weights are the operator-facing dial; epsilon is the within-cell safety net. Operators can add their own hypothesis cells alongside these three (see the [injecting your own treatment hypotheses](../tutorials/injecting-your-own-treatment-hypotheses.md) tutorial). ## Metrics | Metric | Source | What it answers | |---|---|---| | Expected lift | True p0/p1/m0/m1 | "How good is the policy, ignoring noise?" | | Realized lift | Observed revenue | "What does a practitioner actually see?" | | Oracle lift | Treat iff true_tau > 0 | "Upper bound with perfect information" | | Uniform lift | ATE from 50/50 | "Baseline: what if we didn't target at all?" | | Targeting regret | Oracle − realized | "How far from optimal per round?" | | Cumulative regret | Sum across rounds | "Does learning converge?" (should flatten) | | CATE RMSE | Per-visitor estimates | "How accurate are individual CATEs?" | | GATE RMSE | Per-quartile group averages | "How accurate are group-level effects?" | | Segment stability | Jaccard similarity | "Are segment boundaries converging?" | | Segment count | Active segments per round | "How complex is the targeting rule?" | | Policy accuracy | Fraction matching oracle | "What % of visitors get correct assignment?" | **Expected vs. realized lift**: Expected lift isolates policy quality by using true potential outcomes. Realized lift includes sampling noise. Both should increase over rounds, with realized lift converging toward expected. **GATE RMSE is the leading indicator for targeting quality**: if the estimator correctly identifies which *segments* benefit vs. are harmed, the policy tree will make good decisions regardless of individual CATE noise. This is why Joint BCF's GATE RMSE superiority (7/8 scenarios in benchmark) matters more than Proto's individual ranking advantage. ## Multi-Arm Targeting The multi-arm joint hurdle BCF ({doc}`multi-arm-hurdle-bcf`) fits a single prognostic forest shared across all arms and a tau forest whose leaves carry a `(K−1)` contrast vector — one contrast per treatment vs. control. The joint posterior over contrasts is what makes the policy tree's per-arm assignment calibrated: the best-arm rule is applied within each correlated posterior draw, so winner's-curse bias is eliminated. This matches real-world experimentation practice: "free shipping" vs. "10% discount" vs. "bundle offer" — different visitor segments may respond to different promotions. The policy tree produces rules like: ``` IF recency > 30 days: → low_promo ("show discount") ELSE IF order_count > 5: → free_ship ("free shipping offer") ELSE: → control (more data needed) ``` The per-round composition using the public L2 primitives: ```python # Round 1: uniform allocation across arms # (control, low_promo, free_ship, ...) posterior = pt.fit(observed, pooling="joint") # HurdleBCFResult; rpv_cate_samples: (n, S_total, K-1) tree = posterior.fit_policy_tree(max_depth=3) # PolicyTreeResult; tree.segments is the segment-to-arm mapping allocation = posterior.thompson_allocation(segments=tree.segments) # dict[leaf_id, dict[treatment_name, weight]] # best_arm(δ) = 0 if max_j δ_j ≤ 0 else argmax_j δ_j + 1 # Round r: each segment → matched to its best arm # control fraction decays: 50% → 30% → 10% ``` **Progressive control decay**: early rounds keep heavy control allocation for discovery. As estimates stabilize (measured by segment stability and GATE RMSE convergence), control fraction decreases. Control shifts from "discovery" to "monitoring" — a configurable schedule rather than a binary switch. The policy tree labels at K≥3 are computed by applying the shared best-arm rule to each visitor's posterior-mean contrast vector; the sklearn tree is then a multiclass classifier routing future visitors to arms. `arm_best_probabilities` on each `DiscoveredSegment` carries the per-draw frequencies that drove the Thompson allocation — useful for spotting ambivalent segments (near-tied leaders) vs. decisive ones. See [Working with the posterior §2–3](../tutorials/working-with-the-posterior.md) for a worked example of both. ## Scenario Priority From the 8 validated templates, these best showcase sequential targeting: 1. **reversal** — CATE flips sign across visitor features. 50/50 benefiters vs. harmed. Perfect targeting = 2x uniform lift. Best showcase: the policy tree must correctly separate groups. 2. **sparse_benefit** — only ~15% of visitors benefit. Targeting concentrates resources on the minority. Tests selectivity. 3. **monotone_gradient** — smooth CATE gradient across feature space. Tests how policies sharpen segment boundaries over rounds. 4. **constant** — homogeneous treatment effect. Targeting should NOT help. Sanity check: cumulative regret ≈ 0, no spurious segment discovery. ## Benchmark Foundation Results from the 8-scenario calibration run (March 2026, n=50k, Joint hurdle BCF vs. Proto vs. ST-independent): | Scenario | Joint GATE RMSE | Proto GATE RMSE | Winner | |---|---|---|---| | constant | 0.03 | 0.10 | Joint | | reversal | 0.32 | 0.52 | Joint | | sparse_benefit | 0.13 | 0.21 | Joint | | monotone_gradient | 0.04 | 0.47 | Joint | | nonlinear_interaction | 0.09 | 0.06 | Proto | | partial_null | 0.09 | 0.22 | Joint | | high_noise | 0.20 | 0.44 | Joint | | clustered | 0.36 | 0.37 | Joint | Joint BCF wins GATE RMSE 7/8 scenarios. ST-independent retired (loses 0/8). Proto retains advantage in per-visitor ranking for some scenarios. For sequential targeting, GATE RMSE is the relevant metric — it measures how accurately the estimator identifies group-level effects, which directly determines policy tree quality. ## Performance at Scale The [canonical adaptive tutorial](../tutorials/first-adaptive-experiment.md) doubles as the reference benchmark: three rounds of 50k/100k/200k visitors — cumulative fits at 50k, 150k, and 350k — with K = 3 arms at the library-default fit configuration, on a single workstation GPU (Quadro RTX 5000, 16 GB). Measured across repeated runs (GPU fits are not bit-reproducible, so treat these as bands, not pins): | Cumulative n | Round wall-clock | Peak GPU memory | |---|---|---| | 50k | ~2 min | ~0.7 GiB | | 150k | ~4–4.5 min | ~1.0 GiB | | 350k | ~8–10 min | ~2.3 GiB | The whole three-round experiment runs end-to-end in roughly 15 minutes. Round wall-clock covers the full per-round pipeline — the BCF fit dominates; segment discovery, recommendation, and truth comparison are seconds. How that scales: - **Below n ≈ 50k, wall-clock is flat.** Fixed costs (JIT compilation, chain warmup) dominate, so small fits all cost on the order of a couple of minutes regardless of n. - **Beyond the flat region, round time grows roughly linearly with cumulative n** — about 1.1–1.6 s per 1,000 visitors at K = 3 defaults on this card. The upper end of that band is not pure data volume: the loop deliberately grows the tau forest as evidence accumulates (the adaptive `num_trees_tau` sizing), so later rounds buy more model capacity along with more data. - **Arms multiply cost mildly.** At n = 50k, a K = 3 fit costs ~1.4× the K = 2 paired fit (fit-only, same configuration, per-K compilation included). - **Memory is not the binding constraint at these scales.** 350k cumulative visitors peak near 2.3 GiB — a 16 GB consumer card has headroom for cumulative fits well past a million visitors before memory planning is needed. ## Pre-release Empirical Findings These findings come from the pre-release research loop that validated the approach (since superseded by `pt.sequential_experiment` as the one public surface). The evidence stands; the tooling it names is historical. ### Effect scale regime `effect_scale` controls CATE magnitude in the scenario templates. At `es=1.0` (the stress-test regime designed for BCF benchmarking), the signal-to-noise ratio is unrealistically favorable — round 1→2 captures nearly all oracle lift with no room for the flywheel to demonstrate progressive improvement. At realistic e-commerce scales (`es=0.10–0.20`, `baseline_rate=0.03`), the reversal template's heterogeneity vanishes because `effect_scale` currently controls both the average treatment effect and heterogeneity amplitude via a single multiplier on logit-scale slopes. **The clustered template is the best base for targeting simulation** because its cluster structure preserves directional heterogeneity (38% negative-τ visitors) across all effect scales. **Open design issue:** `effect_scale` conflates main effect and heterogeneity amplitude. A two-knob parameterization (separate `main_delta` and `hte_scale`) would allow independent control. The clustered template naturally provides some of this via its cluster structure, but the logit-additive templates (reversal, monotone_gradient, interaction_only) need the fix. **Scenario catalog next step:** extend `MixtureConfig` to support richer cluster structures for realistic targeting simulation: - 4-6 clusters with distinct treatment response profiles (benefit, neutral, harmed — not just 2 groups) - Per-cluster effect vectors (different conversion and severity responses) - Separate `main_delta` (average treatment effect) and `hte_scale` (heterogeneity amplitude) knobs - Effect magnitudes calibrated in probability space, not logit scale The existing `interaction_only` and `nonlinear` templates already cover multi-feature effect surfaces — the gap is in group structure complexity and independent effect parameterization, not interaction patterns. ### Flywheel convergence (clustered, es=0.20, baseline_rate=0.03) With fixed 10k visitors/round, depth 2: - **Round 1**: uniform allocation, GATE RMSE 0.53, oracle gap 0.14 - **Round 3**: policy captures ~75% of oracle gap, GATE RMSE 0.09 - **Rounds 4-5**: stabilize, GATE RMSE ~0.05 The flywheel works: estimation improves → policy improves → data becomes more informative. But depth-2 can't fully close the oracle gap (treats ~90% when oracle says ~61% benefit) because 4 segments can't separate the underlying cluster structure finely enough. ### Visitor scheduling The batched-bandit literature (Perchet et al. 2016, arXiv 1505.00369; Esfandiari et al. AAAI 2021; Che & Namkoong 2023, arXiv 2303.11582) establishes that **Ω(log log T) batches** with geometric/doubling batch sizes suffice to match a fully sequential bandit's regret — i.e., very few rounds are theoretically necessary. In e-commerce, doubling visitors means doubling round duration — feasible for optimization programs spanning months. Geometric scheduling (round k gets `n_base × 2^k` visitors) is first-class at the public surface (`GeometricSchedule`). Its theoretical advantage requires the base batch to be large enough for the BCF to produce informative posteriors. At 3% baseline conversion, `n_base < 5000` yields too few events (~30 treated conversions) for reliable CATE estimation. ### Adaptive tree depth (open question) Deeper policy trees can capture finer heterogeneity but need more data to avoid overfitting noisy CATE estimates. Sample-size-based depth rules (`depth = log2(N / n_min_leaf)`) are too crude — they ignore estimation quality and can produce overfit deep trees on noisy early-round CATEs. The principled approach is **cross-validated policy value**: for each candidate depth, evaluate held-out expected RPV under that policy. This automatically adapts to signal strength rather than just data volume. Not yet implemented. ## Related concepts - {doc}`multi-arm-hurdle-bcf` — the per-round estimator: joint posterior over K−1 treatment-vs-control contrasts, and how result shapes map to per-segment arm probabilities - {doc}`result-objects` — the type hierarchy: `HurdleBCFResult`, `AnalysisResult`, and `Experiment`; the observed-stashing contract that lets analysis primitives reach the input data without it being passed again - [Working with the posterior](../tutorials/working-with-the-posterior.md) — hands-on walkthrough of each analysis primitive the sequential engine composes: `fit_policy_tree`, `thompson_allocation`, `apply_calibration`, `recommendation_summary` - {doc}`overview` — what pytyche does and who it's for - {doc}`bcf-calibration-at-scale` — the benchmark results behind GATE RMSE comparisons cited above