Testing Philosophy¶

Why we test the way we do, and how AI agents change the calculus.

The Core Observation¶

LLM agents are an infinite queue of junior developers you summarily execute after their first PR. Tests are the specification of correctness the agent cannot argue with. A failing test is an objective fact. “This looks right to me” from a model is not.

This observation drives every decision below.

The Fox and the Henhouse¶

When an agent writes both the test and the production code, the test isn’t an objective fact — it’s a self-fulfilling prophecy. The agent can:

Write a trivial test that passes trivially
Write a test that matches its implementation rather than the spec
Modify a failing test instead of fixing the underlying issue
Narrow the test to avoid a hard edge case

Solution: separate the test-writing role from the implementation role.

Contract tests are developed interactively — human and agent collaborate in chat to write and review them, just like the spec itself. The agent proposes tests, the human reviews and refines. Once approved, those tests become the immutable contract for implementation. The implementer agent then makes them pass. The implementer can add new tests but cannot edit existing ones. The contract can only get stricter over time.

This mirrors how real teams work: spec and acceptance criteria are collaborative and reviewed, then handed off for implementation.

Test Tiering by Acceptance Gate¶

Tests are organized by when the feedback is needed, not just by type. Pytyche has expensive-to-run components (MCMC, GPU JAX, SBC sweeps); tiering keeps the inner loop fast while still gating real work behind real verification.

Tier	Speed Target	What Runs	Marker
Fast	~20s (parallel, CPU JAX)	Pure data transforms, contract tests, non-MCMC unit tests	(no marker — unmarked means cheap)
Standard	minutes	MCMC sampling, ML model fits, JAX CPU compile-and-fit	`@pytest.mark.statistical`
GPU	minutes	GPU-required fits, end-to-end BCF fingerprints	`@pytest.mark.gpu`
Calibration	hours	SBC sweeps, parameter recovery validation	`@pytest.mark.calibration`

Entry point: ./scripts/run-tests.sh [tier] (see CLAUDE.md for tier definitions). The agent runs Fast/Standard on the inner write-modify-test cycle. GPU and Calibration tiers gate at logical checkpoints (subagent return, MR, manual). The agent is never blocked waiting for slow tests during implementation.

The tier filter is marker-driven, so markers are the tier’s contract: an unmarked MCMC test silently lands in the fast tier and degrades it for everyone. Two mechanical guards back the convention (tests/conftest.py): integration implies statistical at collection time, and a duration tripwire fails any unmarked test exceeding PYTYCHE_TIER_TRIPWIRE_SECONDS (default 20s) in a single phase — setup included, which is what catches fit-running module-scope fixtures.

For per-test-cost profiling before running broader suites, see CLAUDE.md § “Running tests during development”.

Mocking: Only External Boundaries¶

Mock only what you don’t control. Everything else should be real.

External boundaries you don’t control:

Network (e.g., remote artefact stores, container registries)
Filesystem (unless testing pure transforms)
Time (clock-dependent logic in run monitor, sweep schedulers)
Randomness (when a test needs determinism beyond jax.random.PRNGKey)
GPU compute (when characterizing a fit at a contract level you mock the fit output, not the JAX call graph — see the four-layer BCF suite in CLAUDE-AGENT.md)

Everything else — your own modules, your own data structures, your own kernels — should be real in tests. This protects against the specific failure mode of AI-generated code: satisfying mock expectations while producing subtly wrong behavior.

Anti-pattern: testing mock interactions¶

# BAD — asserts on the mock, not the behavior
def test_calibration_recalibrates_quantiles():
    mock_iso = mock.Mock()
    mock_iso.predict.return_value = np.array([0.1, 0.5, 0.9])
    recalibrate(mock_iso, posterior_quantiles)
    mock_iso.predict.assert_called_once()

# GOOD — asserts on observable output
def test_isotonic_recalibration_maps_overconfident_quantiles_to_truth():
    # Posterior claims 90% coverage but only achieves 70% (overconfident).
    raw_quantiles = np.linspace(0.0, 1.0, 100)
    empirical_coverage = np.clip(raw_quantiles * 0.7, 0.0, 1.0)

    recalibrated = isotonic_recalibrate(raw_quantiles, empirical_coverage)

    # After recalibration, p=0.9 should map close to empirical 0.7.
    assert abs(recalibrated[90] - 0.63) < STATISTICAL

The first test passes if the mock is called correctly but says nothing about whether the calibration is correct. The second test is a failing test the agent can’t argue with.

Contract Tests and the Dispatch Flow¶

Contract tests are developed interactively, not auto-generated. The workflow:

Human and agent collaborate on spec — via the OpenSpec change workflow
Contract tests are written during spec development — agent proposes tests based on spec scenarios, human reviews and refines in chat
Tests are confirmed to fail meaningfully — the test is watched failing to verify it tests the right thing
Implementer is dispatched with the approved contract tests
Implementer reads contract tests — understands the contract
Implementer writes production code — satisfies the contract
Implementer may write new tests — for internal helpers, discovered edge cases
Implementer CANNOT edit existing tests — the contract is immutable during implementation
Implementer reports back — pass/fail + any ambiguity surfaced via the ---QUESTION--- sentinel block (see CLAUDE-AGENT.md)

The contract tests are the definition of “done.” They’re written collaboratively with human review, and cannot be weakened during implementation.

OpenSpec Scenario to Test Mapping¶

Each scenario in an OpenSpec spec should map to at least one test. But scenarios describe behavior, not implementation. The test should assert on the behavior described in the scenario, not on the internal mechanism.

Spec scenario:
  "Given a hurdle BCF fit with true PEHE 0.15,
   when computing the 90% credible interval for tau,
   then the empirical coverage over 100 SBC seeds is within 0.05 of 0.90"

Test:
  def test_hurdle_bcf_90pct_ci_achieves_target_coverage():
      coverages = run_sbc_sweep(n_seeds=100, target_alpha=0.9, generator=hurdle_dgp)
      assert abs(coverages.mean() - 0.9) < 0.05  # SBC band

The test reads like the scenario. No implementation details leak in.

Statistical Testing (pytyche)¶

Statistical code requires a different testing approach. The tolerance IS the design contract:

Level	Tolerance	Use Case
EXACT	1e-10	Deterministic transforms, known inputs
TIGHT	1e-7	Numerical precision, floating point
NUMERICAL	1e-4	Solver outputs, convergence checks
STATISTICAL	1e-2	Parameter recovery from synthetic data
RECOVERY	0.1	MCMC posterior recovery, real data fits
BUSINESS	Context-dependent	Decision-relevant effect sizes

The assertion abs(posterior_mean - true_value) < RECOVERY means “the model must recover the true parameter within 0.1.” This is a testable contract, not a vague aspiration.

Named Patterns¶

Patterns that earned their place across the multi-arm and L2 changes. Use them by name in test designs.

Identity-pin for thin wrappers¶

When a public symbol is a re-export or a delegate to the same function object, equality of outputs is true by construction — output-equality suites test the test. Pin the delegation by identity (pt.fit_policy_tree is pytyche.analysis.fit_policy_tree) plus ONE behavioral smoke through the public name. Five output-equality suites for five re-exports is the anti-pattern.

Bridge tests transfer fingerprints¶

Expensive numerical fingerprints (GPU MCMC baselines) stay pinned on the private raw-array cores. A cheap bridge test pins the public wrapper bit-identical to its core on the same inputs. The composition transfers the baseline guarantee to the public surface without re-recording baselines every time the wrapper plumbing changes — and pure name-surface or wiring changes provably can’t touch numerics.

GPU gate cadence¶

GPU fingerprints re-run at numerics-adjacent section boundaries only — never per batch, never to confirm a rename, never concurrent with any other JAX process. The task tick states which it was (“not numerics-adjacent: no GPU re-run”).

Content-not-format for presentation surfaces¶

Reprs, prose summaries, and printed output are presentation, not contract. Tests assert key CONTENT is present (the decision, the numbers, the treatment name) and the rough shape (multi-line vs row) — never exact formatting. The format can be polished without breaking tests; a NaN-bearing input must repr without raising.

Parametrize over modes, don’t duplicate suites¶

When one surface has modes (pooling="joint" | "independent"), the behavioral suite parametrizes over the mode; mode-specific behavior gets targeted tests. Copy-pasted per-mode suites drift apart and double the maintenance surface.

Tooling¶

pytest JSON output¶

For programmatic consumption (agent-driven verification, CI gates), run pytest with --json-report to produce structured output rather than parsing human-readable text. Combine with --durations=10 to surface the slowest tests when triaging a suite that’s grown expensive.

LSP diagnostics¶

pyright LSP is active. Type errors surface on edit, before the agent even runs tests. This catches an entire class of bugs (type mismatches, wrong signatures, missing imports) at zero cost to the feedback loop. Treat a pyright error as a failing test — don’t ignore it, don’t add # type: ignore to dodge it. If types don’t fit, the design needs to change.

What This Document Informs¶

Artifact	What it takes from here
`CLAUDE-AGENT.md`	TDD methodology section, tolerance hierarchy reference
OpenSpec apply workflow	Contract-test dispatch flow, immutable-contract rule
Reviewer agent definition	Test quality checklist, anti-pattern detection

Living document. Updated as testing practices evolve.