Statistical Honesty¶

How pytyche keeps an experimentation system from fooling itself — at both the algorithmic and the operator level.

An experimentation stack makes promises: “82% chance variant B is better”, “returning customers respond 3× more strongly”, “expected loss of $0.03/visitor if you ship.” These are powerful claims. If the system produces them carelessly — through algorithmic overfitting or operator cherry-picking — they become dangerous. Bad experiment results are worse than no experiment results, because they carry the authority of data.

This page describes the two distinct honesty problems, the machinery pytyche ships today for each, and — clearly flagged as such — the direction the operator-honesty machinery is headed.

The two gardens¶

Gelman and Loken (2013) described the “garden of forking paths”: the observation that researcher degrees of freedom create multiple-comparison problems even without conscious p-hacking. In an experimentation platform this garden has two distinct sections:

Garden 1 — the algorithm’s paths. Which features to split on, how deep to grow trees, where to place thresholds. The model searches a space of possible explanations and can overfit its own search: the splits that survive were selected for producing large apparent effects, and some of that apparent effect is an artifact of the selection itself.
Garden 2 — the operator’s paths. Which segments to examine, which metric to lead with, when to declare “enough data”, whether to re-run with different parameters until a satisfying story emerges. Even with perfectly calibrated posteriors, a human selecting among valid outputs can cherry-pick — not as a character flaw, but because that is how pattern-matching minds work.

Most platforms conflate the two or address only the first. They are different problems with different solutions.

Garden 1: algorithmic honesty¶

Pytyche’s heterogeneous-treatment-effect estimator is a Bayesian Causal Forest (BCF; Hahn, Murray, and Carvalho 2020). BCF addresses the algorithm’s garden through prior regularization rather than sample splitting. The model separates into two forests:

    Y(x) = μ(x) + τ(x)·Z + ε

    μ(x) — prognostic forest (flexible)
           "What is the baseline outcome for this visitor?"

    τ(x) — treatment-effect forest (deliberately more regularized:
           fewer trees, stronger shrinkage toward zero)
           "How does treatment change the outcome for this visitor?"

The asymmetry encodes the correct prior belief: treatment effects are typically smaller and simpler than baseline variation. Effects that are artifacts of the tree search get shrunk toward zero; effects genuinely supported by data survive the shrinkage. No sample splitting is required, so the full dataset is available for both discovery and estimation — the practical reason BCF outperforms honest-splitting forests in small-segment regimes (segments within a split may have only hundreds of visitors; valid-but-useless intervals are still useless).

Prior regularization alone is not enough, though. At production scale, BART-family posteriors are structurally miscalibrated: the same regularization that prevents overfitting also narrows credible intervals below their nominal coverage. Pytyche’s answer is empirical:

Simulation-based calibration (SBC) sweeps fit the full pipeline against hundreds of synthetic datasets with known ground truth, measuring actual vs. nominal interval coverage across realistic data-generating processes.
The measured miscalibration is fitted as a correction artifact (Calibration), which posterior.apply_calibration(...) attaches to a posterior so interval queries are remapped to honest coverage.
The sequential experiment loop warns when run uncalibrated (UncalibratedWarning on the first round of a pt.sequential_experiment(...) constructed without a calibration artifact) — uncorrected intervals are allowed, but never silent.

BCF calibration at scale covers why the miscalibration arises and what the sweeps measure.

One more Garden-1 guard is structural rather than statistical: the type system separates observation from truth. ObservedExperimentData — the only data type the fit and analysis surfaces accept — has no ground-truth field. In simulation mode, planted truth travels in a separate CalibrationTruth object that only the explicitly truth-aware surfaces accept (posterior.evaluate_against_truth(tree, truth)). Analysis code structurally cannot peek at what it is supposed to discover.

Garden 2: operator honesty¶

What ships today is a set of design choices that remove the most tempting forking paths from the operator’s garden:

Inputs, not verdicts. The library surfaces decision-theoretic quantities — expected loss on each side, probability the comparison is better, the expected value of one more round — and the operator’s declared rule makes the call. There is no buried significance machinery to game; thresholds are explicit operator policy. See decision-theoretic inputs.
Sustained evidence, not first crossings. The default graduation rule (ExpectedLossRule) requires its conditions to hold across consecutive rounds (sustained_rounds), so a treatment cannot ship on one lucky round’s blip — the multi-round analogue of “no peeking.”
Controls are permanent. The experiment’s cell structure holds a control share in every round and every segment, however confident the posterior gets. Lift always has a live reference; measurement never silently degrades into “compared to what we remember.”
Claim levels are a typed vocabulary. ClaimLevel (exploratory / honest_estimate / confirmed) lives in pytyche.contracts and describes evidentiary strength, not the splitting mechanism. Today it is carried on the calibration pipeline’s records (SBC runs produce honest_estimate-level evaluations); wiring it through operator-facing analysis output is part of the roadmap below.

The honesty paradox¶

There is a paradox in improving the estimator:

    Better CATE estimates (BCF + calibration)
    → more segments look "real" (supported by posteriors)
    → more credible stories to tell
    → more tempting forking paths for operators
    → MORE need for governance, not less

A weak estimator produces noisy, obviously uncertain results that operators naturally discount. A strong estimator produces precise, credible results that operators trust — including the ones that happen to be the most dramatic. The governance layer must scale with the estimator’s power. That is why Garden 2 is the focus of the next round of work.

Where this is going¶

Forward-looking design intent — not shipped behavior

This section is early, doc-driven design for the v0.3 operator-honesty work. It describes intended direction, subject to change; nothing below exists on the public surface today.

Claim-level gating of operator-facing output. Analysis results carry their ClaimLevel, and ship/stop language is reserved for claims at honest_estimate or above; exploratory findings inform the next round’s design rather than deployment decisions.
Hierarchical pooling as the multiplicity correction. Instead of treating each discovered segment’s effect as independent, model segment effects as draws from a common distribution:
```
    FLAT (current)                    HIERARCHICAL (planned)

    τ₁ ~ Posterior(data₁)             τ₁ ~ Normal(μ_τ, σ_τ)
    τ₂ ~ Posterior(data₂)             τ₂ ~ Normal(μ_τ, σ_τ)
    τ₃ ~ Posterior(data₃)             τ₃ ~ Normal(μ_τ, σ_τ)
                                              ↑       ↑
    Each independent →                  Hyperprior pools across
    no multiplicity correction          segments → shrinkage
```
The segment with a dramatically large effect gets pulled toward the group mean; a genuinely different effect overwhelms the prior and stays large, while a selection-amplified fluke shrinks. This is the Bayesian answer to multiple comparisons Gelman advocates in BDA3 ch. 5 — model the multiplicity instead of penalizing for it.
Informed priors across experiments. After experiment N, the population-level effect distribution becomes an empirical-Bayes prior for experiment N+1 — compound learning encoded as statistical structure rather than institutional memory.

Key references¶

Reference	Contribution
Gelman and Loken (2013), “The Garden of Forking Paths”	The researcher-degrees-of-freedom problem motivating both gardens
Athey and Imbens (2016), “Recursive Partitioning for Heterogeneous Causal Effects”	Honest estimation via sample splitting for causal trees
Hahn, Murray, and Carvalho (2020), “Bayesian Regression Tree Models for Causal Inference”	BCF: Garden 1 via prior regularization instead of splitting
Gelman et al. (2013), BDA3, ch. 5	Hierarchical modeling as the Bayesian answer to multiple comparisons
Stucchio (2015), “Bayesian A/B Testing at VWO”	The expected-loss framework behind the decision-theoretic inputs
Talts et al. (2018), “Validating Bayesian Inference Algorithms with SBC”	Simulation-based calibration — the empirical honesty check pytyche runs at scale

Connection to other concepts¶

Overview — what pytyche is for; honest uncertainty as a design goal rather than a feature flag.
BCF calibration at scale — the empirical miscalibration story behind the SBC + correction layer.
Decision-theoretic inputs — the surfaced quantities that replace shipped verdicts.
Sequential targeting — controls retention and sustained graduation in the multi-round loop.