BCF Posterior Calibration at Scale¶

Why BCF credible intervals are too narrow at large sample sizes, whether it matters for sequential targeting, and what to do about it.

Executive Summary¶

BCF/BART credible intervals systematically undercover at large n. Segment-mean 90% CI coverage degrades from 0.91 at 30k to 0.60 at 1.2M+, monotonically. Point estimates are fine and improving — decision accuracy (CATE sign) rises with n from 0.69 to 0.91, and overconfident-wrong allocation decisions stay rare (2–4%, flat across scales). The problem is specifically that the intervals are too narrow.

The corrected diagnosis: this is a credible-interval width problem — posterior under-dispersion — not a point-estimate bias problem. The intervals are not concentrating around the wrong value; they are concentrating too tightly around roughly the right value. The decisive evidence is that an oracle bias correction — subtracting the known within-leaf approximation error using ground truth — did not restore coverage. If the cause were an O(1) partition-resolution bias (the original hypothesis, below), removing that bias exactly would fix coverage. It didn’t. Alongside this, prior-relaxation and tree-count sweeps moved coverage only modestly (+0.06 to +0.14, plateauing well short of nominal at large n). The missing ingredient is variance, not a shift in center.

Where the missing variance comes from: poor MCMC mixing over tree topology. A correctly-sampled BART posterior’s width includes a between-tree-structure (partition) component — the uncertainty about which partition of covariate space is right. Standard grow/prune MCMC moves over tree structure very slowly: bartz’s own author (Petrillo 2024) states the sampler “moves the tree structures slowly, and in practice in most problems it will not reach convergence.” The chains converge on the noise variance (σ² split-half ratios hit 1.000) but under-explore the partition posterior, so the partition-uncertainty component is largely missing from the intervals — and they come out too narrow. The standard split-half diagnostics measure σ² convergence, not tree-structure convergence, which is why this hid for so long. A direct corroboration: a sampling-level fix (Pratola plain-perturb) recovered cov90 from 0.640 to 0.884 on an n=5000 single-channel Friedman fit. A model-class limitation cannot be repaired by better sampling — so the cause is the sampling, not the model class. (This result is at n=5000; the large-n confirmation is preliminary — see below.)

The original hypothesis — O(1) Donsker bias — is retained below as superseded context. It predicted that more tau trees (via a T³ scaling law) would be the primary fix. That prediction failed empirically: more trees helped partially and plateaued, the T³ formula overestimated coverage on realistic DGPs, and the oracle bias test ruled out bias as the mechanism. The derivation is kept for the record and because the tree-count guidance is still a useful lower bound — just not a complete fix.

Two responses, addressing the same under-dispersion at different points in the pipeline:

Better topology mixing (at the source). Pratola-style rotation/perturb MCMC moves explore tree structure more effectively, restoring the missing partition variance where it belongs — inside the sampler. Preliminary: demonstrated at n=5000; the large-n (250k–750k) perturb sweep is the deciding experiment and is not yet complete.
Post-hoc recalibration (after the fit). Measure the coverage gap on known-truth SBC sweeps and inflate interval width to match — a layered R(p) isotonic correction below the recorded ceiling plus a parametric scale-family inflation past it (see the layered calibration architecture). This is a principled response to an intractable sampling problem, not a patch over a bug: if the chains genuinely cannot traverse the partition posterior at production n in feasible wall-clock, then measuring the shortfall against ground truth and correcting for it is the honest engineering move.

The speed/calibration tradeoff is real and explicit. Both responses cost throughput. Better mixing (perturb/rotation) makes each MCMC sweep more expensive. Recalibration is cheap to apply but requires the upstream SBC sweep to fit the correction, and the corrected intervals are wider — you trade decisiveness for honesty. Calibrated mode is not free: it costs fit throughput (the sweep) and downstream decisiveness (wider intervals mean slower convergence, more exploration). The library ships uncalibrated by default with an explicit UncalibratedWarning, and treats calibration as an opt-in you take on knowingly.

Practical impact is moderate. For the sequential targeting use case, miscalibrated CIs mainly affect stopping rules (premature convergence) and Thompson exploration (under-exploration of uncertain segments). Allocation direction is robust. Estimated welfare loss: 1–3% of oracle lift, concentrated in the 50k–250k window where CIs are degrading but point estimates aren’t yet reliable.

The Phenomenon¶

Segment-mean 90% CI coverage degrades monotonically with cumulative sample size. Data from 9,087 segments across 4 DGP sweeps (~700 BCF fits), March 2026:

Cumulative n	Coverage (90% nom.)	Width/RMSE	Decision accuracy
0-30k	0.91	0.88	0.69
30-120k	0.83	0.65	0.78
120-250k	0.78	0.45	0.83
250-500k	0.73	0.35	0.88
1.2M+	0.60	0.31	0.91

All sweep data generated with estimator="gpu" (alpha_tau=0.75, beta_tau=3.0, 50 tau trees). Config confirmed in run_dgp_sweep.py:424 and run_targeting_sweep.py:159.

What this table shows, read through the corrected diagnosis: CI width shrinks faster than it should. Width/RMSE drops from 0.88 to 0.31 while decision accuracy improves from 0.69 to 0.91 — the point estimate is getting better while the interval around it collapses. The interval is under-dispersed: it omits a variance component (partition uncertainty) that an honestly-sampled posterior would carry. It is not that the posterior centers on the wrong value (the oracle bias test rules that out); it is that it is too sure of the value it centers on.

Corrected diagnosis: under-dispersion from tree-structure mixing¶

The intervals are too narrow because the sampler does not explore the posterior over tree structure — the partition of covariate space — fast enough to carry its full uncertainty into the credible interval.

The chain of evidence that points here, not at bias:

The oracle bias test. Retain per-tree leaf assignments, compute the within-leaf approximation error against known ground truth, subtract it from the posterior draws, and re-measure coverage. If an O(1) approximation bias were the cause, this exact correction would restore nominal coverage. It did not. Bias is not the mechanism. (This is the decisive negative result; if you are re-running it, treat the magnitude of residual undercoverage as the quantity to report.)
Tree-count sweeps plateau. The bias theory’s T³ scaling law predicted that going 50→200 tau trees would extend nominal coverage to ~2M. Empirically (2026-03-27 sweep, below) it lifted 100k from 0.73 to 0.89 but only moved 500k from 0.61 to 0.70 — partial and plateauing, not the predicted near-complete fix.
Prior/depth relaxation is small. Loosening the tau prior (α=0.75/β=3.0 → 0.95/β=2.0) bought +0.06 to +0.14 — real but far short.
A sampling fix works where a model fix shouldn’t. Pratola plain-perturb recovered cov90 0.640 → 0.884 at n=5000. Better sampling cannot repair a model-class limitation, so the limitation is in the sampling.
The diagnostics that said “converged” measured the wrong thing. σ² split-half ratios hit 1.000 — the noise variance is mixed. But tree-structure mixing is a separate question those diagnostics never touched, and Petrillo (2024, the bartz author) is explicit that grow/prune “moves the tree structures slowly, and in practice in most problems it will not reach convergence.”

The mechanism. An honestly-sampled BART/BCF posterior carries two sources of width: within-structure leaf-value uncertainty (which the chains do capture — it is the part that shrinks correctly at 1/√n) and between-structure partition uncertainty (which they largely do not, because grow/prune cannot traverse the space of partitions in feasible wall-clock at large n). The credible interval is missing the second component, so it is narrower than the true posterior would be. This is posterior under-dispersion, and it is a property of the sampler’s reach, not of the function class. It worsens with n precisely because the within-structure component shrinks (1/√n) while the missing between-structure component does not — so its relative omission grows.

The undercoverage is also non-uniform: visitors near the CATE sign boundary (small |τ|) are worst, high-CATE visitors better. That pattern is consistent with missing partition variance (the boundary is where partition choice matters most) and not especially consistent with a uniform additive bias.

The calibration response: recalibration as a principled correction¶

If the sampler genuinely cannot traverse the partition posterior at production n within a feasible compute budget, then the missing variance cannot always be recovered at the source — and pretending the raw intervals are honest would be the actual error. The principled response is to measure the shortfall against ground truth and correct for it:

Simulation sweeps with known truth (the SBC-style calibration sweeps) record, per nominal coverage level, the empirically achieved coverage. (Note: “SBC” here means simulation-based coverage evaluation + correction, not the classical rank-statistic Simulation-Based Calibration of Talts et al. — see the SBC (in pytyche) glossary entry.)
A layered correction maps requested coverage onto the interval width that actually achieves it: a non-parametric isotonic R(p) below the empirically-recorded ceiling, and a parametric scale-family width inflation past that ceiling (the regime the sweeps can’t reach directly). See the calibration architecture and the pytyche.calibrate module.

This is not a patch over a bug. The under-dispersion is the honest output of a sampler doing the best it can on an intractable mixing problem; the recalibration is a measured, reproducible correction with a known provenance, applied with eyes open. It is the same epistemic stance as the rest of the library: surface the uncertainty honestly rather than report a number that looks more confident than the evidence supports.

Two responses, one diagnosis. Better topology mixing (perturb/rotation) attacks the under-dispersion at the source; recalibration corrects what mixing cannot reach. They are complementary, not alternatives — and both cost throughput, which the next section makes explicit.

The speed/calibration tradeoff¶

Neither response is free, and the costs land in different places:

Response	When you pay	What it costs
Perturb/rotation mixing	At fit time	More expensive MCMC moves per sweep — lower fit throughput (the perturb move’s cost is the subject of an active performance refactor; large-n viability is preliminary).
Post-hoc recalibration	Once upstream + a little at apply time	An SBC sweep to fit the correction (GPU-hours), then a cheap per-fit apply. The corrected intervals are wider, so downstream the loop converges more slowly and explores more.

The library’s default is uncalibrated, with an explicit UncalibratedWarning on every fit, because the cheapest honest thing is to tell you the intervals are overconfident rather than silently widen them. Calibrated mode is an opt-in: you take on the sweep cost and the wider-interval cost in exchange for intervals you can put a coverage number on. State this tradeoff to anyone choosing a mode — “calibrated” is slower and less decisive by construction, and that is the point.

Superseded hypothesis: Donsker bias in piecewise-constant ensembles¶

Retained as context — not the current diagnosis

This section presents the originally-posited O(1) partition-resolution bias root cause (the Donsker-failure reading). It is superseded: the oracle bias test (subtracting known within-leaf error did not restore coverage) and the plateauing tree-count sweeps moved the diagnosis to posterior under-dispersion from tree-structure mixing failure — see the Executive Summary and the corrected diagnosis. The math below is kept because the single-leaf posterior derivation is correct and the tree-count formula remains a useful lower bound; only the framing of “bias as the root cause” is retired.

The single-leaf posterior¶

The conversion channel leaf posterior (bcf/lml.py) is:

Model:    r_i | θ ~ N(h_i · θ, 1)         h_i = ±0.5 (basis coding)
Prior:    θ ~ N(0, τ_prior)                τ_prior = 1/T_tau = 1/50 = 0.02

Posterior (Normal-Normal conjugacy):

precision = Σ h_i² + 1/τ_prior = n_leaf · 0.25 + 50
posterior_mean = Σ(h_i · r_i) / precision
posterior_var = 1 / precision

As n_leaf grows, the prior contribution (50) becomes negligible. The posterior concentrates on the data mean at rate 1/n_leaf.

The ensemble CATE¶

The CATE is τ̂(x) = Σ_t θ_t(x) — sum of T=50 tau tree leaf values. Since trees are conditionally independent given structure:

Var[τ̂(x)] = Σ_t 1/(n_{leaf,t} · 0.25 + 50)

For stumps with ~2 leaves per tree, n_leaf ≈ n/2:

n	Posterior SD per visitor
30k	0.113
250k	0.040
500k	0.028
1.2M	0.018

Posterior SD shrinks as O(1/sqrt(n)). This is correct Bayesian behavior — the problem is what it’s converging to.

The bias: what the ensemble converges to¶

At large n, the BART ensemble converges to the best L² approximation of τ(x) within the additive piecewise-constant function class:

F_T = {Σ_{t=1}^T a_t · 1(x_{j_t} > s_t) + c : a_t ∈ R, j_t ∈ {1,...,d}, s_t ∈ R}

For a Lipschitz-L function τ: [0,1]^d → R with additive structure τ(x) = Σ_j f_j(x_j), the best L² approximation error with T stumps optimally allocated across d dimensions is:

Bias² ≈ [Σ_j L_j^{2/3}]³ / (12T²)

For d=5, L_j = L: Bias² ≈ 10L²/T².

This bias is O(1) — it depends on the function’s smoothness and the number of trees, not on n. No amount of additional data reduces it because the partition structure doesn’t refine with n (the tree depth prior is fixed).

The critical n where coverage breaks¶

Coverage requires that the z-score of the bias relative to posterior SD stays within the CI’s critical value:

|Bias| / SD[τ̂(x)] ≤ z_{0.95} = 1.645    (for 90% CI)

Substituting:

sqrt(10L²/T²) / sqrt(8T/n) ≤ 1.645
→ n ≤ 1.645² · 8T³ / (10L²)
→ n_critical ≈ 2.16 · T³ / L²

This is the key result: n_critical scales as T³.

Config	n_critical (L=1)	n_critical (L=3)
T=50 (current)	270k	30k
T=100	2.2M	240k
T=200	17M	1.9M

The L=3 column (moderate heterogeneity) aligns well with the empirical data — coverage starts degrading around 30k. For the simpler DGP templates (lower L), coverage holds longer.

The Donsker condition (why this is fundamental)¶

Breunig, Liu & Yu (2025, “RoBART”, arXiv:2509.24634) prove that BART’s posterior satisfies a Bernstein-von Mises theorem but with an explicit bias term:

√n(χ̂ - χ₀) →_d N(b₀, σ²_eff)

where b₀ = P_n[(γ₀ - 1)(m₀ - m_η)] is the non-vanishing bias from piecewise-constant approximation. The bias b₀ arises because the influence function class for piecewise-constant estimators fails the Donsker condition when covariates are multivariate.

The Donsker requirement demands Holder smoothness α > q₀/2 where q₀ is the number of active covariates. With d=5 covariates, we need α > 2.5, but piecewise-constant functions have α ≤ 1. The bias is structural — not a hyperparameter choice.

Theoretical backdrop: tree count and approximation quality¶

Rockova & van der Pas (2020, Annals of Statistics) prove that BART regression ensembles achieve minimax-optimal posterior contraction rates when the total number of trees T ~ √n. This result applies to generic BART regression, not specifically to BCF’s tau forest — the tau forest is deliberately constrained via the regularization prior (alpha_tau, beta_tau) to encode the assumption that treatment effect heterogeneity is simpler than the prognostic surface.

Nonetheless, the T ~ √n scaling illuminates the mechanism: with fixed T, the piecewise-constant approximation quality plateaus while the posterior keeps concentrating. The practical implication for the tau forest is that its additive resolution (how many distinct step functions compose the CATE surface) is the binding constraint. The n_critical ~ T³ result derived above applies specifically to the tau tree count.

Why Existing Fixes Don’t Work¶

Heteroscedastic model: +2.2% coverage, doesn’t fix the gap¶

Per-leaf gamma (freeze_gamma=False) adjusts the per-observation precision v_i = τ₀ · τ̂_hat_i. This enters the severity leaf posterior through precision-weighted sufficient statistics (the hurdle sufficient-statistics path in bcf/lml.py):

sum_vr = Σ(v_i · h_i · r_i)    # precision-weighted
sum_vh2 = Σ(v_i · h_i²)
prec_beta = sum_vh2 + kappa     # posterior precision

Gamma fixes the likelihood weighting (which observations matter more) but not the prior on leaf values (how much the ensemble can express). The bottleneck is expressiveness, not noise estimation.

Empirical result (250k×4ch, clustered_realistic):

Homoscedastic: 70.2% coverage at 90% nominal
Heteroscedastic: 71.8% coverage at 90% nominal (+2.2%)

GFR warm-start depth: 5 vs 10 sweeps — no difference¶

GFR grows trees greedily from root using marginal likelihood (He & Hahn 2021). At 5 sweeps, the GFR trees are already near their data-supported depth. But the MCMC prior (alpha_tau=0.75, beta_tau=3.0) then prunes them back toward stumps. More GFR sweeps just give MCMC more to prune.

The GFR→MCMC depth mismatch is confirmed by the structural prior analysis: grow proposals at depth 1 face a prior penalty of exp(-2.32) ≈ 0.098 — an order of magnitude penalty before the likelihood ratio. The MCMC equilibrium overwhelmingly favors stumps regardless of initialization.

He & Hahn’s own Table 4 shows warm-start improves coverage from 0.73-0.84 to 0.87-0.99 for the mean function. But this improvement is already baked into our workflow. The remaining gap is the structural bias.

Semiparametric correction (Yiu et al.): incomplete for hurdle RPV¶

The theory is elegant: one-step EIF correction + Bayesian bootstrap variance inflation. Prototyped in four variants (compute_channel_correction, formerly pytyche.bcf.hurdle.correction).

Status (2026-06): removed. The prototype never delivered calibrated composed-RPV intervals and had no live consumers — the earlier sequential-targeting loop (since removed) and the calibration pipeline read result.rpv_cate_samples directly. The compute_channel_correction symbol was removed from the public API in the multi-arm-joint-hurdle-bcf change rather than extended to K arms.

Caveat: the per-channel analysis in the remainder of this section is historical and not current class-A understanding — the severity-vs-binary calibration framing below in particular does not hold up. Treat it as context for why the correction was attempted, not as a current diagnostic. A broader refresh of this doc is owed separately.

The critical diagnostic (per-channel coverage at 90% nominal):

Severity: already well-calibrated — sev0=0.898, sev1=0.934. Not the bottleneck.
Binary conversion: where coverage degrades — same tree-structure-prior concentration mechanism as the continuous case.

This is counterintuitive. Severity (lognormal, heavy-tailed, converters only) is fine. Binary conversion (bounded [0,1], all visitors, simpler model) is the problem. The explanation:

The severity channel gets accidental calibration from multiple uncertainty sources:

τ₀ sampled from Gamma posterior → error precision uncertainty
Per-leaf γ sampled → heteroscedastic observation weighting
Only ~30% of visitors contribute → smaller effective n per leaf
These widen the posterior enough to maintain ~0.90 coverage

The binary channel is too clean:

σ² = 1 FIXED (probit convention) → no noise variance uncertainty
ALL n visitors contribute → larger n per leaf → fastest concentration
No τ_hat/gamma mechanism → no additional variance inflation
Posterior concentrates at rate 1/(n_leaf · 0.25 + 50)

Four decomposition strategies tested:

Method	Approach	Result
Naive RPV	Direct IPW on composed p·m	IPW noise from zero-inflation overwhelms (~39× CATE)
Ratio EIF	Binary + severity separately, compose	Mechanically works per channel
Log-space	Converter-restricted, exp back-transform	Jensen’s inequality on back-transform
gate_dp	Binary correction only, plug-in severity	Most promising — fixes the right channel

gate_dp is the right idea — it corrects the binary channel (where the miscalibration lives) and leaves severity alone (already calibrated). But the composed RPV GATE = E[p₁·m₁ - p₀·m₀] isn’t sufficiently improved.

Open question: why doesn’t binary-channel correction translate to calibrated RPV? The product composition p·m is nonlinear — residual binary bias gets amplified through multiplication by severity. The per-channel correction doesn’t account for cross-channel coupling in the product.

Why IPW noise blows up under zero-inflation (theoretical, needs empirical re-confirmation)¶

Empirical anchors:

DR pseudo-outcome variance ~39× CATE for zero-inflated revenue, ~1.4× for continuous (conformal_calibration.py spike).
Naive RPV decomposition’s scale factors 20–60×, leading to 100% overcoverage.
Yiu et al. semiparametric correction tried in 4 variants, all failed to calibrate RPV.

Conjectured mechanism — the explanation below is a reasoned account of why we observed the empirical findings; the variance-decomposition steps have not been independently re-derived against fresh data and should be verified before being relied on. The argument:

IPW weights blow up as π → 0 or 1. The efficient influence function has the form φ_i = (R_i/π - (1-R_i)/(1-π))·(Y_i - μ̂(x_i, R_i)) + τ̂(x_i). For non-converters (R=0), the weight is 1/(1-π) ≈ 1.1 at π=0.1 conversion — fine. For converters (R=1), the weight is 1/π ≈ 10× at π=0.1. So converters’ residuals are amplified by 10× in the EIF.
Revenue is heavy-tailed. Log-normal severity (or our approximate log-normal generators) puts substantial mass in the tail. The residual Y_i - μ̂(x_i, R_i) is itself large for converters, especially in regions where the mean estimate is biased.
The product is multiplicative: 10× weight amplification × 5–10× residual amplitude gives ~50–100× contribution to the EIF variance for converted observations. For continuous outcomes (no zero-inflation, R≡1), this multiplier doesn’t compound — π is constant, residuals are bounded by Gaussian assumption — so the variance ratio stays near 1.
The Bayesian-bootstrap variance inflation step in Yiu’s correction uses this EIF. When the EIF variance is itself unstable, the bootstrap inflation adds noise faster than it adds calibration — wider intervals that don’t track truth.

Implications, also conjectural:

Any post-hoc estimator that consumes IPW or DR pseudo-outcomes likely has the same failure mode for zero-inflated semicontinuous outcomes: conformal meta-learners (Alaa et al. 2023), augmented IPW, doubly-robust ATE/CATE estimators. Our experiments only confirmed this for Yiu and Alaa; we extrapolate to the class.
Calibration corrections must operate on regression-based residuals, not IPW residuals. Candidates: causal isotonic calibration (van der Laan et al. 2023, regression-based), GATE corrections that don’t reweight by propensity, or doing the inference better in the first place (BCF prior structure, more tau trees).
Channel-separated correction (gate_dp pattern) is structurally promising but the cross-channel coupling at composition is unaccounted for. A correction that explicitly handled the product covariance might work — has not been tried.

What needs empirical re-confirmation before this section can be cited:

Replicate the 39× / 1.4× variance ratio with fresh conformal_calibration.py run on the current hurdle DGPs.
Profile the EIF variance contribution per observation (converter vs non-converter, by propensity bin) — the 1/π amplification claim is mechanical but our actual implementation may use a different EIF parameterization.
Test whether a regression-based calibration (e.g., causal isotonic) avoids the failure mode we attribute to IPW.
Confirm that other DR estimators (not Yiu specifically) fail similarly on the same DGPs — the “kills any IPW/DR method” generalization is an extrapolation from our N=2 (Yiu, Alaa) sample.

The summary lesson — “IPW residuals are too noisy in the zero-inflated hurdle regime to drive calibration corrections” — is empirically well- supported. The mechanistic story above is consistent with that observation but is not independently verified.

More MCMC iterations: better estimates of the wrong thing¶

Our MCMC shows R-hat ~1.0 and ESS 150-200+ per chain. We’re well-mixed. The coverage gap (~20-30 points at large n) is far too large to be MCMC sampling error. Longer chains give more precise estimates of the same biased posterior.

The Lever Analysis¶

The full landscape¶

Lever	Reduces bias?	Widens CIs?	Cost	Impact
More tau trees (50→200)	Yes (T³ scaling)	Yes (more variance)	4× tau VRAM, ~1.6× time	High
Deeper tau trees (α/β sweep)	Slightly (~30% more leaves)	Slightly	Free (config change)	Low-moderate
Looser leaf prior	No	Yes (wider posteriors)	Free	Symptomatic only
More MCMC samples	No	No (same model)	Linear time	None
Scale-blind R(p)	No	Yes (wrong magnitude)	Sweep data	Harmful
Scale-conditioned R(p)	No	Yes (per scale band)	Large sweep data	Moderate
gate_dp (binary EIF)	Partially (fixes binary channel)	Yes (BB inflation)	O(S×n) post-processing	Moderate

Tier 1: more tau trees (50→200)¶

Mechanism: The T³ scaling law means 4× more trees extends n_critical by 64×. At T=200, coverage should hold to ~17M (L=1) or ~1.9M (L=3).

VRAM cost: Additional tau leaf_indices: 150 trees × n × 4 bytes × 4 chains. At n=500k: 150 × 500k × 4 × 4 = 1.2 GB. Fits on a 3090 (current model uses ~15-16 GB for 500k×4ch).

Time cost: MCMC sweeps through 200 tau trees instead of 50 — each sweep step visits every tree. Total trees goes from 250 (200μ + 50τ) to 400 (200μ + 200τ), a ~1.6× slowdown per MCMC iteration.

Theoretical grounding: The coverage formula T_min = (d_τ·σ_τ·√n/3.29)^{2/3} predicts T≥73 for “Typical” complexity at 500k, and T≥133 for “Complex” at 1M. T=200 provides headroom above both thresholds. See the reference table above.

Interaction with regularization: More trees means each tree contributes less (leaf prior variance = 1/T). The ensemble is more heavily regularized per tree but more expressive overall. This is the right direction — many weakly-contributing trees give finer additive resolution while maintaining shrinkage.

Tier 2: tau tree depth sweep (in progress)¶

Three RunPod 3090 pods testing alpha_tau/beta_tau at (0.75/3.0, 0.85/2.5, 0.95/2.0) across 500k-1.25M.

Expected effect: the “deep” setting (0.95/2.0) gives ~125 tau leaves vs ~95 current. This is a 30% increase — modest compared to the T³ scaling from more trees (4× more trees = 64× more headroom).

The depth sweep tests the hypothesis that tree depth matters more than tree count. The theory says it doesn’t — the T³ scaling comes from the number of distinct step functions in the additive basis, and a stump with 2 levels contributes as much resolution as a deeper tree’s 4 levels divided across interactions.

Not recommended: loosening the leaf prior¶

Reducing leaf_prior_cov_inv from T*4 to T*1 would widen posteriors by ~2×. This is symptomatic treatment — it doesn’t reduce bias, just makes the CIs wide enough to paper over it. The calibration would depend on the prior setting matching the bias magnitude, which varies by DGP and scale. Scale-blind isotonic R(p) already failed for the same reason.

The Welfare Question¶

Where miscalibrated CIs hurt¶

In the sequential targeting loop, CIs serve three functions:

1. Allocation direction — which segments get treated

P(τ>0) drives allocation via segment-mean posterior (_fit_policy_tree:547):

seg_cate_means = rpv_cate_samples[mask].mean(axis=0)  # (S,)
p_benefit = float((seg_cate_means > 0).mean())

Overconfident posteriors push P(τ>0) toward 0 or 1, making allocation more decisive. This is actually helpful when the posterior mean has the correct sign (96-98% of the time). Only the 2-4% overconfident-wrong cases hurt.

2. Stopping rules — when to declare convergence

CIs excluding zero (confidence in effect direction) drive convergence detection. Overconfident posteriors declare convergence early. When the posterior mean is correct, this is fine — just aggressive. When wrong, it locks in a bad decision.

3. Thompson exploration — how much to explore uncertain segments

Overconfident posteriors mean Thompson draws are too concentrated, leading to under-exploration of segments with genuine uncertainty. This matters most at medium n (50k-250k) where some segments are still uncertain.

Quantitative welfare estimate¶

Per-segment misallocation probability: ~3% (from empirical overconfident-wrong rate). With 5 segments:

P(≥1 segment misallocated) ≈ 1 - 0.97⁵ ≈ 14% per round
Expected misallocated segments: 0.15 per round
Each misallocated segment costs ~20% of oracle lift for ~20% of visitors
Direct allocation welfare loss: ~0.6% of RPV per round

Premature stopping cost:

System converges 1-2 rounds early vs. well-calibrated CIs
Late-round improvement is diminishing (~5-10% of remaining oracle gap)
Stopping welfare loss: ~5-10% of remaining oracle gap

Total welfare loss: 1-3% of oracle lift. Not catastrophic, but meaningful for a system designed to compound value over rounds.

The highest-risk window: 50k-250k¶

This is where calibration degradation coincides with still-uncertain point estimates:

Decision accuracy: 0.78-0.83 (not yet reliable)
Coverage: 0.78-0.83 (degrading)
This is when the system makes its most impactful allocation decisions (segments are being discovered, not yet confirmed)

Below 50k: CIs are wide enough, coverage is good. Above 250k: point estimates are reliable enough that overconfident CIs don’t change decisions.

Choosing num_trees_tau: Theory and Practice¶

Tree count is a lower bound, not the fix

This section’s T³ formula is derived from the superseded bias diagnosis and empirically overestimates the coverage that more trees deliver on realistic DGPs (see the 2026-03-27 sweep). Read it as a minimum — too few tau trees is its own problem and the formula tells you the floor — but not as a route to nominal coverage at large n. Closing the remaining gap is the job of better topology mixing and post-hoc recalibration, above.

The coverage formula for the BCF tau forest¶

The BCF tau forest’s calibration depends on three quantities:

n — total sample size (cumulative across sequential targeting rounds)
d_τ — effective dimensionality of the CATE surface (how many covariates actually moderate the treatment effect — typically much smaller than total feature count)
σ_τ — standard deviation of the true CATE across the covariate distribution, on the tau forest’s working scale (probit for binary, standardized for continuous)

From the n_critical derivation (prior section), the bias of a T-stump tau forest approximation to a Lipschitz CATE surface is:

Bias² ≈ d_τ² · σ_τ² / T²

And the tau forest’s per-visitor posterior variance (probit channel, σ²=1, h²=0.25 from adaptive basis coding):

Var_post ≈ 4T/n

The coverage condition for a (1-α) CI requires the z-score to stay within the critical value:

|Bias| / SD_post ≤ z_{1-α/2}
d_τ · σ_τ / T ≤ z_{1-α/2} · 2√(T/n)

Solving for T:

T_min = ⌈(d_τ · σ_τ · √n / (2 · z_{1-α/2}))^{2/3}⌉

For 90% CIs (z=1.645): T_min = ⌈(d_τ · σ_τ · √n / 3.29)^{2/3}⌉

Why this is BCF-specific, not generic BART¶

Several BCF properties are baked into this formula:

Basis coding h²=0.25 enters the posterior variance denominator. Standard BART has h=1, giving Var ≈ T/n. The ×4 factor from BCF’s adaptive coding means the posterior concentrates 4× faster, making the calibration constraint binding at 4× smaller n than standard BART.
The binding channel is binary (probit) with σ²=1 fixed. The severity channel gets “accidental calibration” from τ₀ sampling and per-leaf γ, so it’s not the constraint. The formula uses the probit channel’s mechanics.
d_τ is the HTE dimensionality, not the total feature count. BCF’s mu forest absorbs prognostic variation. The tau forest only needs to capture treatment effect modifiers — typically 2-5 covariates even when the total feature space is much larger. The tau prior (α_tau, β_tau) and BART’s variable selection naturally discover d_τ.
σ_τ is the CATE heterogeneity, not the outcome variation. The mu forest absorbs most of Var(Y). The tau forest captures only the treatment effect surface. With BCF’s prior on leaf scale (Var = 1/(T×4)), the prior SD of the ensemble CATE is √(T × 1/(T×4)) = 0.5 on the standardized scale.

Reference table: T_min for 90% CI coverage¶

Rows = cumulative sample size. Columns = CATE complexity profiles.

n	Low (d=2, σ=0.3)	Moderate (d=3, σ=0.5)	Typical (d=4, σ=0.7)	Complex (d=5, σ=1.0)
30k	9	19	30	42
100k	13	28	43	62
250k	18	38	58	84
500k	22	47	73	105
1M	28	60	92	133
2M	36	75	116	167

Reading: at n=500k with “Typical” complexity (d_τ=4, σ_τ=0.7), you need T≥73 tau trees for 90% CI coverage. Our current T=50 falls short, matching the empirical ~0.73 coverage at that scale.

For 95% CIs (z=1.96), the wider interval tolerates more bias, so T_min is lower: T_min(95%) = T_min(90%) × (1.645/1.96)^{2/3} ≈ 0.89 × T_min(90%). Wider CIs are easier to calibrate.

How to estimate d_τ and σ_τ¶

d_τ (treatment effect dimensionality):

Estimate from the policy tree or variable importance after a preliminary fit:

Count the number of distinct variables that appear in the depth-3 policy tree — this is an upper bound on d_τ
Or check the tau forest’s variable selection frequency (how many covariates appear in >10% of tau trees)

E-commerce priors when no data is available:

Simple layout/UX tests: d_τ ≈ 1-2 (device type, maybe one demographic)
Product page experiments: d_τ ≈ 2-4 (device, engagement, category, price)
Personalization experiments: d_τ ≈ 4-6 (behavioral + demographic features)

σ_τ (CATE heterogeneity):

After a preliminary fit, read est_spread from compute_slim_diagnostics() — this is the std of posterior mean CATE across visitors, on the standardized scale.

To convert from natural units: for the probit channel at baseline conversion rate p₀, a conversion rate difference of Δp corresponds to probit-scale CATE ≈ Δp / φ(Φ⁻¹(p₀)) where φ is the standard normal density. At p₀=0.05, φ(Φ⁻¹(0.05)) ≈ 0.10, so ±3pp HTE → probit σ_τ ≈ 0.03/0.10 = 0.3.

E-commerce priors:

Homogeneous ATE (calibration unnecessary): σ_τ < 0.1
Mild heterogeneity: σ_τ ≈ 0.2-0.4 (different segments vary by ±50% of ATE)
Moderate heterogeneity: σ_τ ≈ 0.4-0.8 (some segments benefit, some neutral)
Strong reversal: σ_τ ≈ 0.8-1.5 (some segments harmed — the reason you’re running the experiment)

Default recommendation¶

For the sequential targeting use case with geometric batching:

Early rounds (n < 50k): T=50 is adequate. Coverage ≥ 0.85.
Mid rounds (50k-500k): T=100-150 recommended. The 50k-250k window is highest risk (CIs degrading, point estimates not yet reliable).
Late rounds (500k+): T=200+ needed for complex CATEs.

The practical default: T=200. This covers all scales up to ~1-2M for moderate complexity (d_τ ≤ 4, σ_τ ≤ 0.7), which encompasses most e-commerce experiments. The cost is ~1.6× MCMC time and ~1.2 GB additional VRAM at 500k×4ch.

An alternative: adapt T to cumulative n across sequential targeting rounds. Start with T=50 for round 1 (n=10k-30k, where T=50 is fine), then increase to T=100 at round 3 (n ~60-100k), and T=200 at round 5+ (n ~250k+). This requires re-initializing the tau forest across rounds, which is natural since BCF is refit on all accumulated data each round anyway.

Depth correction factor¶

The formula assumes stumps (~2 leaves per tree). With deeper trees (e.g., α_tau=0.95, β_tau=2.0 giving ~2.5 leaves per tree), each tree contributes more partition resolution. The effective T is:

T_eff = T × avg_leaves / 2

With the current prior (α=0.75, β=3.0): avg_leaves ≈ 1.9, T_eff ≈ 0.95T ≈ T. With the “deep” setting (α=0.95, β=2.0): avg_leaves ≈ 2.5, T_eff ≈ 1.25T.

The depth correction is modest. Doubling T (50→100) gives 2.83× T_eff gain. The deepest prior setting gives 1.25× T_eff gain. More trees dominates.

VRAM budget¶

The dominant cost of tau trees is the leaf_indices array: (T_tau, n) int32 per chain.

T_tau	VRAM at 500k×4ch	VRAM at 1M×4ch	Fits 3090 (24GB)?
50 (current)	0.4 GB	0.8 GB	Yes
100	0.8 GB	1.6 GB	Yes
200	1.6 GB	3.2 GB	Yes (tight at 1M)
400	3.2 GB	6.4 GB	Needs Pro 6000 at 1M

These are incremental costs on top of the mu forest (200 trees, ~1.6 GB at 500k×4ch). The total BCF model at 500k×4ch with T_tau=200 uses ~18 GB, fitting on a 3090 with ~6 GB headroom.

Validation: checking the formula against sweep data¶

The formula predicts T_min for the sweep data (α_tau=0.75, T=50):

Sweep scenario	Est. d_τ	Est. σ_τ	T_min at n=500k	Actual coverage
constant (no HTE)	~0	~0	<10	~1.0 (overcalibrated)
monotone_gradient	~2	~0.3	~22	~0.88-0.95
clustered_realistic	~4	~0.7	~73	~0.75-0.84
clustered_complex	~5	~1.0	~105	~0.69-0.78

The ordering matches: coverage degrades as T_min exceeds 50. The constant template (σ_τ≈0) has T_min near zero and is overcalibrated (prior wider than needed). The complex clustered template has T_min well above 50 and shows the worst coverage.

Recommended Path Forward¶

Ordered by the corrected diagnosis: attack the under-dispersion at the source where the sampler can reach it, correct what it cannot, and keep enough tau trees that tree count is never the binding constraint.

Better topology mixing — perturb / rotation moves (the source-level fix)¶

Pratola-style perturb and rotation MCMC moves explore tree structure more effectively than grow/prune alone, restoring partition variance inside the sampler. The n=5000 result (cov90 0.640 → 0.884) is the encouraging signal; the large-n (250k–750k) perturb sweep is the deciding experiment and is not yet complete — treat large-n viability as preliminary. The cost is fit throughput (the move is more expensive per sweep; a performance refactor is active). This is the most principled fix because it makes the raw posterior honest rather than correcting it after the fact.

Post-hoc recalibration (correct what mixing can’t reach)¶

Where the sampler cannot traverse the partition posterior in feasible wall-clock, fit a layered correction on known-truth SBC sweeps (isotonic R(p) below the recorded ceiling, scale-family width inflation past it) and apply it at fit time. Build the sweep across multiple scales (10k, 50k, 250k, 500k, 1M) with enough fits per band for stable isotonic regression; calibration_sweep_v2.py supports --scales. This is opt-in and costs the sweep plus wider intervals downstream — see the speed/calibration tradeoff above. It is the principled correction for an intractable sampling problem, not a patch.

Keep tau trees above the floor (necessary, not sufficient)¶

Too few tau trees is its own miscalibration source; the num_trees_tau formula gives the minimum for a given scale and CATE complexity. Default num_trees_tau=200 clears the floor for most e-commerce scales (~1.6× MCMC time, ~1.2 GB extra VRAM at 500k×4ch). But — per the corrected diagnosis — clearing the floor is not the same as reaching nominal coverage; the 2026-03-27 sweep showed more trees plateauing well short at 500k+. Set tau trees high enough that they are not the binding constraint, then rely on mixing and recalibration for the rest.

Phase 3: revisit gate_dp (binary EIF correction)¶

The gate_dp method corrects the right channel (binary conversion, where coverage degrades) and leaves severity alone (already calibrated at ~0.90). The remaining question is why binary-channel correction doesn’t fully translate to calibrated RPV GATE. Two hypotheses:

The product composition p·m amplifies residual binary bias through multiplication by severity — the correction is per-channel but the estimand is a nonlinear function of both channels
The BB variance inflation on binary is correctly sized for the dp estimand but undersized for the composed RPV estimand

Investigating this requires: enable channel retention, run prototype_channel_correction.py with T=200 tau trees, compare binary dp coverage vs. composed RPV GATE coverage to see if more trees + gate_dp together close the gap.

What to skip¶

Adaptive prior scaling with n: Engineering effort for uncertain payoff; better addressed by actually increasing T
Heteroscedastic extensions: Already tested, +2.2% doesn’t justify complexity
More GFR sweeps: Already tested, no effect
RoBART double-debiasing: Faces the same severity channel issues as other semiparametric approaches for the composed RPV estimand

Open Questions¶

The depth sweep results: Will deeper tau trees (α=0.95/β=2.0) improve coverage enough to matter? Theory predicts modest improvement (~30% more leaves). Results expected from the 3 RunPod pods.
VRAM-compute tradeoff at T=200: Is 1.6× slowdown per MCMC iteration acceptable? Could be offset by fewer MCMC iterations (200→100) if the wider posterior from more trees improves mixing.
Why doesn’t gate_dp translate to RPV coverage? The binary channel correction should fix the dominant source of miscalibration. The composition p·m may create cross-channel coupling that per-channel correction doesn’t capture. This is the most important open research question.
Binary channel’s probit nonlinearity: On the latent probit scale, the tau forest bias is additive. But on the probability scale (through Φ), bias near the tails (p ≈ 0 or p ≈ 1) gets compressed while bias near p ≈ 0.5 is amplified. Does this interact with the T³ scaling?
Does severity’s “accidental calibration” hold at T=200? With more trees, both channels get finer partitions. If severity was calibrated partly from having extra variance sources (τ₀, γ), reducing bias via more trees might make severity CIs tighter without reducing them to the point of miscalibration — but worth monitoring.

Session 2026-03-27: Realistic DGPs, Depth Relaxation, and Root Cause Analysis¶

DGP fix: heterogeneous prognostic surfaces¶

The prior calibration sweeps used flat prognostic surfaces (p0 = constant, m0 = constant). All heterogeneity was loaded onto the treatment arm. This was structurally unrealistic — in real e-commerce, prognostic variation is 3-10× larger than treatment variation. The flat DGPs gave mu nothing to do, which likely inflated T_mu leakage effects and made coverage look better than it would be in practice.

Fix implemented: Per-cluster prognostic baselines (latent-type-driven) + within-cluster feature-driven gradients on p0 and m0. prognostic_scale sampled per-DGP from U(0.3, 1.0). max_aov_ratio clamps m0 to prevent exp() blowup in log-space severity signals. Features are noisy windows into latent types — BCF only sees features, not cluster_id.

T_mu capacity: not the bottleneck¶

Sweep: 5 DGPs × {T_mu=100, 200} × {T_tau=50, 200} × {500k, 750k}. Mean Δcov for T_mu 100→200: +0.003 (noise, not signal). The mu forest at 100 trees is sufficient for the prognostic surfaces we generate.

T_tau=200: helps but doesn’t solve¶

Scale	T_tau=50	T_tau=200	Δcov
100k	0.734	0.890	+0.155
250k	0.593	0.718	+0.125
500k	0.614	0.698	+0.084

T=200 nearly reaches nominal at 100k but coverage degrades with scale. The T³ formula predicted T=200 sufficient to ~2M — it overestimates coverage on realistic DGPs.

Depth relaxation: meaningful improvement¶

alpha_tau=0.95/beta_tau=2.0 vs default 0.75/3.0, at T_tau=200:

DGP	Default 250k	Relaxed 250k	Δ
85388 (3cl)	0.692	0.833	+0.141
49553 (5cl)	0.702	0.797	+0.096

1.25M head-to-head (Pro 6000, 5 DGPs):

Default mean: 0.606
Depth relaxed mean: 0.665 (+0.060)
Positive for all 5 DGPs

Depth relaxation allows trees to capture interaction effects that stumps miss. Longer chains (200+400 MCMC) gave no improvement (+0.02/-0.02), confirming the issue is structural (tree expressiveness) not convergence.

Fit diagnostics: σ² is mixed, but tree-structure mixing was not measured¶

σ² split-half ratio = 1.000 across all fits, so the noise variance has converged. This was originally interpreted as “chains are mixed” full-stop — but σ² split-half ratios do not measure tree-structure mixing, which is the relevant diagnostic for the perturb hypothesis (Petrillo 2024: bartz’s MCMC “moves the tree structures slowly, and in practice in most problems it will not reach convergence”). Reframed (2026-05-17): σ² is converged; whether tree structure has converged is untested by these diagnostics and is what the Stage B large-n perturb experiment will settle. Under either hypothesis the symptom is the same — the converged-on-its-own-terms posterior is too narrow:

T_tau	100k w/sp	250k w/sp	500k w/sp
50	0.34	0.24	0.15
200	0.58	0.32	0.22

Width-to-spread ratio of 0.22 at T=200/500k means CIs are 22% of the true CATE spread. Quintile analysis shows the undercoverage is non-uniform: visitors near the CATE sign boundary (small |τ|) are worst, while high-CATE visitors (q4-q5) have better coverage.

Root cause: Donsker condition failure (literature support — but see 2026-05-17 reframe)¶

Literature review identified Donsker failure as the candidate mechanism (and the inferred mechanism at the time this section was written). Note (added 2026-05-17): the perturb n=5000 result (cov90 0.640 → 0.884 from a sampling-level fix), the diminishing returns from increasing num_trees, and Petrillo’s bartz-author acknowledgment of poor tree-structure mixing have shifted the framing — tree-structure mixing failure is now a competing root-cause hypothesis on at least equal footing. The literature support below remains relevant under the Donsker hypothesis; it does not refute the mixing hypothesis.

Key sources:

Breunig & Yu (2025, RoBART): BART’s piecewise-constant function class fails the Donsker condition (requires smoothness α > dimension/2). The posterior for treatment effect functionals has a non-vanishing bias term b₀ that persists with perfect mixing and infinite MCMC. This is a property of the function class, not the number of trees. More trees reduce per-tree error but don’t change the class.

Deshpande et al. (2023, ACIC 2022 reanalysis): BCF coverage at large n: 33-82%, median 49%. At small n (practice-level): 65-99%, median 87.5%. Confirms coverage degrades with n — opposite of correctly specified models.

Giordano et al. (2025): Regularization bias “bleeds into” plug-in estimators. Good contraction rates ≠ good coverage. One-step EIF correction achieved nominal coverage across all ACIC 2016 DGPs, but only tested at small n on continuous outcomes.

Implication: “More trees” is the wrong fix. The fix is either: (a) change the function class to satisfy Donsker (smooth leaves), or (b) explicitly debias the posterior via influence function corrections.

Both are untested for hurdle RPV at large n.

Candidate approach: smooth leaves via random features¶

ridgeBART (Yee, Ghosh & Deshpande 2024) replaces constant leaf values with smooth functions: g(x) = Σ β_d · φ(ω_d · x + b_d). At D=1 this is one nonlinearity per leaf.

Analysis of BCF adaptation shows:

Conjugacy preserved in both channels (probit + log-severity)
Heteroscedastic Normal-Gamma severity unchanged
The effective_basis = basis * phi trick keeps existing sufficient stat code structurally identical
Computational overhead ~5%, VRAM negligible
~200-300 lines of changes in _hurdle_step_forest

Practical concern: Inner parameter (ω, b) sampling at large n. The ridgeBART paper proposes from the prior — acceptance rates will be near zero at n=500k+. Options: fixed random features (D=20-50, fully conjugate Gibbs, no inner sampling), HMC-within-Gibbs (JAX autodiff for gradients), or particle methods.

Next step: oracle bias test¶

Before implementing smooth leaves, test whether within-leaf approximation bias is the dominant mechanism. Run a BCF fit retaining per-tree leaf assignments, compute the oracle within-leaf bias using known ground truth, subtract from posterior draws, check if corrected coverage approaches nominal. If yes, the smooth-leaf path is justified. If not, the cause is elsewhere (σ² underestimation, prior structural constraints, hurdle composition effects).

Key References¶

Breunig, Liu & Yu (2025). RoBART: Robust Semiparametric Bayesian Inference. arXiv:2509.24634. [The theoretical explanation — Donsker failure, non-vanishing bias, double debiasing fix]
Rockova & van der Pas (2020). Posterior Concentration for Bayesian Regression Trees and Forests. Annals of Statistics 48(4). [T ~ √n for optimal contraction; contraction ≠ coverage]
Yiu, Fong, Holmes & Rousseau (2025). Semiparametric Posterior Corrections. JRSS-B 87(4):1025+. arXiv:2306.06059. [One-step EIF correction; ACIC 2016 validation]
Hahn, Murray & Carvalho (2020). Bayesian Regression Tree Models for Causal Inference. Bayesian Analysis 15(3). [BCF; tau prior regularization; no n-scaling guidance]
Kokandakar, Kang & Deshpande (2022). Bayesian Causal Forests and the 2022 ACIC Data Challenge. arXiv:2211.02020. [49-53% coverage at 90% nominal for subgroup effects]
He & Hahn (2021). Stochastic Tree Ensembles for Regularized Supervised Learning. arXiv:2002.03375. [XBART/GFR; warm-start improves coverage to 0.87-0.99]
Linero & Yang (2018). Bayesian Regression Tree Ensembles That Adapt to Smoothness and Sparsity. JRSS-B 80(5). [Soft BART; adaptation to α > 1]
Yee, Ghosh & Deshpande (2024). Scalable piecewise smoothing with BART. arXiv:2411.07984. [ridgeBART; smooth leaf functions; D=1 ridge per leaf; posterior contraction for anisotropic Hölder; achieves rates for α > 1]
Giordano, Stephenson, Broderick et al. (2025). Semiparametric Posterior Corrections. JRSS-B 87(4):1025+. [General framework for regularization bias in plug-in estimators; EIF correction per posterior draw; best ACIC 2016]
Kim & Rockova (2025). On Mixing Rates for Bayesian CART. Electronic Journal of Statistics 19(2). arXiv:2306.00126. [Polynomial mixing for standard grow/prune; exponential for deep isolated signals; Twiggy BART fix]
Deshpande et al. (2023). Bayesian Causal Forests & the 2022 ACIC Data Challenge. Observational Studies 9(3). arXiv:2211.02020. [Large-n BCF coverage: median 49% at 95% nominal; propensity model affects coverage]