Overview

Pytyche runs smarter, calibrated multi-round A/B tests. Vanilla tools give you one number — “treatment lifts revenue 5%.” Pytyche tells you which customer segments drive the lift and which don’t respond, discovers those segments from your data (no pre-declared cohorts), and over a series of rounds proposes redirecting more traffic toward responders while keeping controls everywhere so measurement stays honest.

Built for online experimentation at typical web scale — 50k to 2M visitors per round. Designed around hurdle-distributed outcomes: revenue, conversion-then-spend, anywhere most users don’t convert and the ones who do have heavy-tailed magnitude. Posteriors are recalibrated against simulation-based ground truth, so “we’re 90% confident” is actually a 90% claim — empirically, vanilla BCF coverage at 90% nominal degrades from 0.91 at 30k to 0.60 at 1.2M+ without recalibration; the SBC machinery closes that gap.

Speed isn’t the contribution — speed unlocks the contribution. GPU joint hurdle BCF (5×–60× faster than the StochTree CPU backend at production scale) is what makes per-deployment calibration sweeps cheap enough to run routinely. Calibration becomes something you do per-deployment instead of per-publication; with calibrated posteriors, sequential targeting becomes safe enough to actually deploy.

The loop

The operator surface for the loop is pt.sequential_experiment(...) — a stateful experiment you iterate round by round (walked end-to-end in the first adaptive experiment tutorial). Each round, the loop takes observed round data — plus optional ground truth when it’s available (from a simulated source: built-in templates, your own DGP, anything that returns both halves). Same loop in both cases. The ground-truth half only kicks in when you ask for truth-comparison diagnostics (CATE RMSE vs truth, oracle policy quality, SBC sweeps for calibrating the posteriors).

The mechanics, round by round:

  1. Fit joint hurdle BCF on all accumulated data so far. Two coupled forests share tree topology — a probit-conversion channel and a log-severity channel — fit jointly so the structural prior carries information across channels. This stabilizes per-segment CATE estimates at low conversion rates, the regime online experiments live in.

  2. Discover responder segments by fitting a shallow policy tree on the round’s CATE estimates. Segments are leaves of a depth-3 decision tree — discovered from data, not pre-specified. The policy tree doubles as a stakeholder-facing interpretability surface: an auditable handle on the model’s segment-level decisions.

  3. Recalibrate the posterior credible intervals using an SBC-fitted calibration artifact when one is attached. This is the load-bearing honesty step — without it, BART’s regularization prior produces credible intervals that are systematically too narrow at scale. Runs without an artifact carry an explicit uncalibrated label.

  4. Allocate next round as a set of weighted cells: a Control cell (baseline) and an Explore cell (uniform-random) that never fall below their configured weight floors, plus an Optimized cell that routes visitors through the policy tree by Thompson sampling from the posterior. The floors are the structural mechanism that lets you detect when you were wrong: every round keeps baseline measurement and every-treatment sampling alive as a drift-detection surface. If a segment’s response shifts (population changed, seasonality, upstream UX change), the controls retention is what surfaces it — without it, a confidently-wrong allocation is undetectable until something breaks downstream.

Across rounds the system narrows allocation toward responder segments while never sacrificing the measurement infrastructure that lets you detect when you were wrong. That trade — exploit and keep re-checking — is the whole reason the loop exists.

What you get that vanilla A/B testing doesn’t

  • Per-segment treatment-effect estimates, not just an overall lift number — and segments are discovered, not declared.

  • 5 segments tends to recover near-all the per-visitor personalization value at much lower complexity (Zhang & Misra 2022, “Coarse Personalization,” in a food-delivery promo RCT). Pytyche operationalizes this — segments are the unit of allocation, not per-visitor propensities.

  • Per-visitor propensities are explicitly rejected as the allocation unit for four reasons: (1) instability — small estimation changes produce large swings; (2) operational opacity — a 50k-row propensity table is impossible for a merchandising team to act on; (3) statistical complexity — IPW for non-uniform per-visitor propensities introduces variance inflation; (4) cross-channel scope — per-visitor propensities only work for real-time storefront personalization, not for cross-channel decisions like email targeting or pricing changes.

  • Calibrated uncertainty — vanilla credible intervals out of BART at production scale undercover (0.91 → 0.60 from 30k to 1.2M); pytyche’s SBC machinery recalibrates so the claimed coverage is real.

  • A next-round allocation rule that exploits what you’ve learned without breaking what you’ll need to learn next.

  • Sim-mode dress rehearsals — the same loop runs against the DGP machinery that drives calibration, so you can rehearse a design (rounds, sizes, schedules) against a planted ground truth before spending real traffic, and the rehearsal story and the calibration story are consistent by construction.

  • Honest-uncertainty contracts (pytyche.contracts.ClaimLevel) that prevent analysis code from accidentally peeking at ground truth in simulation contexts, and graduate-to-rollout signals that require sustained evidence across rounds. See statistical honesty for the deeper framing.

Who this is for

A growth PM, data scientist, or experimentation engineer who can use Python and pandas but doesn’t want to think about MCMC. Treatments are typically simple in round 1 and get more complex as you learn what’s worth testing further. The expected workflow is: ship a round, hand the observed data to pytyche, read the segment-GATE summary, decide whether to graduate / drop / continue each segment, ship the next round.

What it isn’t

Not a contextual bandit / personalization framework — bandits update per-visitor allocation continuously and treat the per-visitor propensity as the unit of optimization. Pytyche updates per round at segment granularity, matching how real online experiments are run and matching the operational reality of cross-channel decisions. Not a frequentist tool — every output is a posterior, and “we’re 90% confident” means what it says.

Scope and assumptions

Pytyche draws its boundaries explicitly. Out of scope, with no roadmap intent:

  • Cross-visitor interference (SUTVA violations). Pytyche assumes one visitor’s treatment doesn’t affect another’s outcome. Marketplaces — ride-share, food delivery, two-sided rentals — break this; use a marketplace-aware causal stack.

  • Heavily-regulated decision contexts. Pytyche’s honesty machinery is informational, not preregistration- or FDA-grade (see statistical honesty for what it does provide). Clinical trials, regulated finance, and content-moderation rulings need regulator-aware infrastructure.

  • Very-large-catalog recommenders. Pytyche estimates per-treatment effects for a handful of treatments, not per-item effects across thousands of items.

  • Real-time adaptive systems (bidding, trading, feed ranking). Round-based scheduling assumes a pause between rounds; there is no online streaming surface.

  • Heavy-tailed revenue with extreme outliers. Domains where 99th-percentile revenue runs hundreds of times the median may not fit the log-normal severity model; re-check the calibration story for your distribution (calibration at scale).

  • Calibration-scale mismatch. SBC corrections fitted at one scale applied at a very different one may themselves be miscalibrated — match your sweep’s scale to your experiment’s scale.

Observational inference is supported, with caveats — not out of scope. BCF is purpose-built for confounded observational settings: it takes propensity scores into the prior and gives strong point estimation (the ACIC-class use case). Two things to know before you rely on it here. First, pytyche expects propensity scores as an input — it has no built-in nuisance/propensity estimation or double-ML cross-fitting, which is the reason to reach for econml or DoubleML if you need those. Second, the library is shaped and validated around designed experiments (explicit assignment rules, exactly-recorded propensities), so observational use is less tested. As everywhere else, treat the credible intervals as needing calibration at your scale before you lean on them: strong point estimation, intervals that need the SBC recalibration step.

One known limitation is a plausible future direction: drift / non-stationarity. Fitting on cumulative data assumes a stationary response surface; pytyche has no change-point detection today. The controls-retention floors are what surface a shifted segment in the meantime.