--- title: Working with the posterior review-state: drafting last-human-review: "2026-06-11" depends-on: - src/pytyche/bcf/hurdle/model.py - src/pytyche/bcf/config.py - src/pytyche/contracts.py - src/pytyche/compare/variants.py - src/pytyche/generators/scenarios.py - src/pytyche/generators/core.py owner: tradcliffe quadrant: tutorial jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 kernelspec: display_name: Python 3 language: python name: python3 --- # Working with the posterior A topic-organized walk through the posterior's analysis surface — fitting, inspecting, allocating, deciding, verifying. The narrative "day in the life" of a data scientist running a real adaptive experiment lives in [first adaptive experiment](first-adaptive-experiment.md); this doc is the reference companion you reach for when you know what capability you need and want to see it isolated. Each section covers one capability, shows the canonical method call, and explains how the output composes with the other methods. The {term}`joint posterior` is the root primitive everything derives from — see [result objects](../concepts/result-objects.md) for the type hierarchy. ## Setup ```{code-cell} ipython3 import os os.environ.setdefault("JAX_PLATFORMS", "cpu") # delete this line to run on your GPU import pytyche as pt ``` To confirm your JAX platform, GPU availability, and library versions, run `pt.check_setup()`. The runnable examples use pytyche's `clustered_realistic` generator so the tutorial is self-contained: ```{code-cell} ipython3 from pytyche.generators.core import generate_v2_core config, _meta = pt.generators.clustered_realistic( "revenue_per_visitor", 1.0, K=3, n_visitors=2_000, seed=42 ) bundle = generate_v2_core(config) observed, truth = bundle ``` `pt.generators.(...)` is the attribute-access sugar for `pt.TEMPLATES[""](...)` — both resolve to the same registered template function. The first two positional arguments are `metric_id` (e.g. `"revenue_per_visitor"`) and `effect_scale` (the planted signal amplitude). `K` is the treatment count (control + K−1 active treatments). The template returns a `(V2GeneratorConfig, dict)` tuple; `generate_v2_core(config)` returns a `CalibrationBundle` — a `NamedTuple` that unpacks to `(observed, truth)`. `truth` carries the ground truth the generator planted; the [truth comparison](#truth-comparison) section uses it. Every other section reads identically to what you'd write against real data. ## §1 — Fitting and dispatch `pt.fit(observed)` is the canonical entry point. It inspects the data shape and dispatches to the right BCF flavor: - continuous outcome → `fit_continuous_bcf` - binary outcome → `fit_binary_bcf` - zero-inflated outcome (hurdle) → `fit_hurdle_bcf` For the dataset above (zero-inflated revenue, three treatments) it dispatches to the joint multi-arm hurdle BCF. :::{admonition} Runtime :class: note Expect this fit to take a few minutes on CPU, or about a minute on a GPU. The same code scales to hundreds of thousands of visitors on GPU — see [your first hurdle BCF fit](first-hurdle-bcf-fit.md) for sizing guidance. ::: ```{code-cell} ipython3 posterior = pt.fit( observed, num_chains=2, num_mcmc=200, num_burnin=200, seed=42, ) ``` If you want explicit control over the fit function (skipping the auto-dispatch), call `pt.fit_hurdle_bcf(observed, ...)` directly. The returned posterior is identical bit-for-bit when seeds and kwargs match — `pt.fit` is a thin dispatcher, not a separate fit path. The default `pooling` kwarg for hurdle outcomes is `"joint"` — the canonical shared-tree fit that borrows strength across the conversion and severity channels. Pass `pooling="independent"` (binary arm only) for the two-stage baseline. The posterior stashes `observed` on itself (`posterior.observed`), so downstream methods don't need it passed again. The default copy semantics is `view` — `posterior.observed is observed` — which is what you want for in-process flows. Pass `observed_copy="deep"` on `pt.fit` if you need the posterior to retain an independent copy (serialization across processes, mutation guarantees). ## §2 — Inspection `posterior.analyze()` produces the canonical analysis output — global per-treatment contrasts plus a fitted policy-tree segmentation. This is the single call that answers "what does the data say about this experiment?" The fit sees every feature column in the observed frame. For this dataset that is five e-commerce features (`session_recency`, `browse_depth`, `device_type`, `is_returning`, `channel`) plus five noise columns (`z0`–`z4`) the generator added. Which columns carry signal is truth-side knowledge the analysis path never receives — the policy tree has to discover it from the CATEs alone, with the no-signal columns competing for every split: ```{code-cell} ipython3 analysis = posterior.analyze() analysis ``` If you have a segment hypothesis of your own (say, "mobile visitors respond differently"), you don't need `analyze` for it — declare the segment with the rule algebra and ask the per-segment primitives directly: `posterior.recommendation_summary(treatment, segment=...)` gives the declared segment's decision-theoretic snapshot, and `posterior.thompson_allocation(segments=[...])` its allocation. A higher-level convenience for declaring segment hypotheses up front is planned but not yet available; the compositional path above is the supported route today. ### Global contrasts ```{code-cell} ipython3 for comparison in analysis.comparisons: print(comparison) ``` `P(lift > 0)` is the posterior probability that the treatment beats control at the population level. The lift point estimate and credible interval are in outcome units (revenue per visitor, for this dataset). The 80% credible interval is the contracts.py convention — tighter than 95% to bias toward action; see [statistical honesty](../concepts/statistical-honesty.md) for the reasoning. ### CATE per visitor `analysis.cate_per_visitor` holds the posterior-mean CATE per visitor. At K=3 its shape is `(n, K-1)` — one column per treatment contrast (treatment 1 vs control, treatment 2 vs control): ```{code-cell} ipython3 import numpy as np print(f"cate_per_visitor shape: {analysis.cate_per_visitor.shape}") print(f" (n_visitors, K-1) = ({analysis.cate_per_visitor.shape[0]}, " f"{analysis.cate_per_visitor.shape[1]})") print(f" mean per contrast: {analysis.cate_per_visitor.mean(axis=0)}") ``` ### Segments ```{code-cell} ipython3 for segment in analysis.segments: print(segment) ``` Each segment carries: - `gate_estimate` — posterior mean of the segment-level group average treatment effect (GATE) - `gate_ci` — 80% credible interval on the GATE - `population_share` — fraction of the population in this segment - `stability_score` — bootstrap replicability, see below - `arm_best_probabilities` — per-arm posterior probability that the arm is best in this segment, see below The segmentation came from a {term}`policy` tree fitted on per-visitor CATEs over every feature column in the observed frame. ### Capability detection ```{code-cell} ipython3 posterior.has_decomposition() ``` `has_decomposition()` is `True` for hurdle posteriors — the fit produces both a conversion channel and a severity channel, so a decomposition exists. ```{code-cell} ipython3 posterior.has_credible_segments(threshold=0.90) ``` `has_credible_segments(threshold)` checks whether any discovered segment has `stability_score >= threshold`; the default threshold is `0.80`. ### Stability score `stability_score` is a bootstrap replicability metric, not a posterior credible-interval width: > Pytyche resamples the per-visitor CATEs with replacement, refits the > policy tree on each resample (the posterior is NOT refit — tree > refits are seconds), and reports the fraction of refits that recover > a matching segment boundary (Jaccard overlap ≥ 0.5 with the original > segment's members). A score of `1.0` means every bootstrap finds this segment; `0.0` means it's a one-off the data happened to support. The default `n_bootstrap = 50`; set `posterior.fit_policy_tree(n_bootstrap=...)` to override. Setting `n_bootstrap=0` skips the bootstrap entirely — segments then carry `stability_score = NaN` and a `UserWarning` is emitted. The `bootstrap_seed` kwarg makes bootstrap results deterministic across calls: ```{code-cell} ipython3 tree = posterior.fit_policy_tree( max_depth=3, min_segment_share=0.10, n_bootstrap=50, bootstrap_seed=7, ) for leaf_id, score in tree.stability_scores.items(): print(f" leaf {leaf_id}: stability_score={score:.2f}") ``` `tree.stability_scores` is a `dict[int, float]` keyed by leaf id. Each segment in `tree.segments` has a matching `id` field for lookups: `tree.stability_scores[segment.id]` == `segment.stability_score`. ### Per-segment arm best-probabilities ```{code-cell} ipython3 for segment in analysis.segments: print(f"{segment.rule.description}: {segment.arm_best_probabilities}") ``` `arm_best_probabilities` is a dict keyed by variant name — control included. Values are the posterior probability that the arm is best in this segment under the shared best-arm rule (at K=3: the arm with the largest positive contrast wins; control wins when every contrast is non-positive). Values sum to 1.0. Use this to: - spot leaves where the engine is confident (one arm dominates) vs ambivalent (two or more arms near-tied) - find drop-treatment candidates (arms with near-zero probability of being best in every segment) - decide per-segment ship vs hold (a segment where the leader has `P > 0.90` is shippable; one where it's `0.45` against a `0.40` runner-up is still exploring) See [decision-theoretic inputs](../concepts/decision-theoretic-inputs.md) for how this composes with the global expected-loss picture. ## §3 — Allocation `posterior.thompson_allocation(segments, epsilon=0.02)` returns the per-segment treatment mix as a `dict[int, dict[str, float]]` keyed by leaf id. Each value is a per-treatment weight dict that sums to 1.0. ```{code-cell} ipython3 allocation = posterior.thompson_allocation(segments=tree.segments) for segment_id, weights in allocation.items(): print(f" segment {segment_id}: {weights}") ``` The `epsilon` kwarg is a safety-net floor inside the Thompson computation (each arm keeps at least `ε/K` of the leaf's allocation, so no arm is starved to zero). It is NOT the dial for how much traffic stays on control — controls retention is a property of the experiment's cell structure, set with `min_control_weight` and `min_explore_weight` on `pt.sequential_experiment(...)`. See [sequential targeting](../concepts/sequential-targeting.md) for how the control, exploration, and optimized cells compose. The allocation basis is `segment_mean` — every visitor in a segment gets the same allocation. Per-visitor allocation is not exposed at the public surface. ### Function-form alternative `pt.fit_policy_tree` and `pt.thompson_allocation` are thin function-form wrappers that delegate to the method forms. The two produce identical results: ```{code-cell} ipython3 # Both forms return the same result: tree_via_fn = pt.fit_policy_tree(posterior, max_depth=3) ``` ## §4 — Decision support `posterior.recommendation_summary(treatment, segment)` returns the decision-theoretic snapshot — the inputs you compose into a graduation rule. `segment=None` (the default) computes the global snapshot over all visitors for that treatment's contrast: ```{code-cell} ipython3 treatment_name = posterior.observed.treatment_names[0] posterior.recommendation_summary(treatment_name) ``` `expected_value_of_one_more_round` is the information-theoretic value of running one more round at the same per-round n, in outcome units (expected-loss-reduction per visitor). Near-zero means the experiment has effectively converged and additional data is unlikely to change the decision. The standard composition: - **Ship** when `expected_loss_comparison` is below your tolerance for N consecutive rounds. - **Continue** when `expected_value_of_one_more_round` exceeds the per-visitor cost of another round of latency. - **Stop** when EVOR is at-or-below the round cost AND no contrast has cleared the ship threshold — the experiment has converged on an ambiguous answer. `analysis.recommendation` is the same global snapshot for the best challenger (the treatment with the largest global posterior-mean contrast). Per-segment snapshots pass a `DiscoveredSegment`: ```{code-cell} ipython3 posterior.recommendation_summary(treatment_name, segment=analysis.segments[0]) ``` The library doesn't have an opinion on your thresholds. See [decision-theoretic inputs](../concepts/decision-theoretic-inputs.md) for the formulas, when to trust them, and an example custom rule. ## §5 — Visualization ```{code-cell} ipython3 import pytyche.viz as ptviz ``` `pt.viz` is lazily imported — matplotlib is only loaded when you touch `pt.viz.*`. Each primitive accepts an optional `ax=` parameter for subplot composition into an existing figure; when omitted it creates a fresh figure and returns the `Axes`. Forest plot of segment credible intervals: ```{code-cell} ipython3 if analysis.segments: ax = ptviz.plot_segment_intervals(analysis.segments) ``` Policy-tree visualization: ```{code-cell} ipython3 ax_tree = ptviz.plot_policy_tree(tree) ``` Pass `ax=existing_ax` to compose either plot into a figure you manage; the function renders into the provided axes and returns it. The `pt.viz.experiment_evolution_gif(history, ...)` animated GIF primitive (which renders multi-round experiment history) requires the sequential surface's `Experiment` type and arrives with that work. ## §6 — Hurdle decomposition For hurdle outcomes — zero-inflated metrics like revenue-per-visitor — the per-visitor effect decomposes into a change in conversion probability (more buyers) and a change in basket size given conversion (larger orders). ```{code-cell} ipython3 if posterior.has_decomposition(): for comparison in analysis.comparisons: print(f"{comparison.treatment} — {comparison.decomposition!r}") ``` `posterior.has_decomposition()` returns `True` for joint hurdle posteriors and `False` for continuous or binary posteriors (those have a single channel; nothing to decompose). Continuous and binary posteriors expose every other method on this page — only the hurdle decomposition is hurdle-specific. (truth-comparison)= ## §7 — Truth comparison `posterior.evaluate_against_truth(tree, truth)` is the sim-only diagnostic. It exists when (and only when) the observed data came with a planted ground truth — i.e. when you generated the data with `pt.generators.*` and have access to `bundle.truth`. Production data never has this. ```{code-cell} ipython3 posterior.evaluate_against_truth(tree=tree, truth=truth) ``` - **CATE RMSE** — per-visitor error in the recovered conditional treatment effect vs the planted one (pooled across all `n × (K-1)` contrast entries at K=3). - **Policy accuracy** — fraction of visitors whose recommended treatment matches the truth-optimal one for their features. - **Oracle gap** — per-visitor regret of the recommended policy vs an oracle policy that always picks the best treatment. - **rpv_policy / rpv_uniform / rpv_oracle** — the expected revenue per visitor under the predicted policy, a uniform-allocation baseline, and the oracle policy, respectively. These three together let you quantify how much value the policy tree captures relative to the theoretical maximum. This is the same comparison the SBC calibration sweep does at scale to fit the R(p) calibration corrections (see [BCF posterior calibration at scale](../concepts/bcf-calibration-at-scale.md)). Running it on individual analyses is useful when prototyping a new generator template, validating that the calibration regime matches your domain, or onboarding a new analyst to the library. ## Cross-references - [First adaptive experiment](first-adaptive-experiment.md) — the narrative tutorial that puts these capabilities in a single end-to-end flow. - [Result objects](../concepts/result-objects.md) — the type hierarchy and observed-stashing semantics referenced throughout. - [Decision-theoretic inputs](../concepts/decision-theoretic-inputs.md) — the formulas behind §4. - [Sequential targeting](../concepts/sequential-targeting.md) — why §3's allocation is structured the way it is. - [BCF posterior calibration at scale](../concepts/bcf-calibration-at-scale.md) — the calibration workflow (`apply_calibration` and the R(p) corrections) and why it's needed. A dedicated hands-on tutorial is planned. - [Glossary](../concepts/glossary.md) — term definitions.