Testing Philosophy

Why we test the way we do, and how AI agents change the calculus.


The Core Observation

LLM agents are an infinite queue of junior developers you summarily execute after their first PR. Tests are the specification of correctness the agent cannot argue with. A failing test is an objective fact. “This looks right to me” from a model is not.

This observation drives every decision below.


The Fox and the Henhouse

When an agent writes both the test and the production code, the test isn’t an objective fact — it’s a self-fulfilling prophecy. The agent can:

  • Write a trivial test that passes trivially

  • Write a test that matches its implementation rather than the spec

  • Modify a failing test instead of fixing the underlying issue

  • Narrow the test to avoid a hard edge case

Solution: separate the test-writing role from the implementation role.

Contract tests are developed interactively — human and agent collaborate in chat to write and review them, just like the spec itself. The agent proposes tests, the human reviews and refines. Once approved, those tests become the immutable contract for implementation. The implementer agent then makes them pass. The implementer can add new tests but cannot edit existing ones. The contract can only get stricter over time.

This mirrors how real teams work: spec and acceptance criteria are collaborative and reviewed, then handed off for implementation.


Test Tiering by Acceptance Gate

Tests are organized by when the feedback is needed, not just by type. Pytyche has expensive-to-run components (MCMC, GPU JAX, SBC sweeps); tiering keeps the inner loop fast while still gating real work behind real verification.

Tier

Speed Target

What Runs

Marker

Fast

~20s (parallel, CPU JAX)

Pure data transforms, contract tests, non-MCMC unit tests

(no marker — unmarked means cheap)

Standard

minutes

MCMC sampling, ML model fits, JAX CPU compile-and-fit

@pytest.mark.statistical

GPU

minutes

GPU-required fits, end-to-end BCF fingerprints

@pytest.mark.gpu

Calibration

hours

SBC sweeps, parameter recovery validation

@pytest.mark.calibration

Entry point: ./scripts/run-tests.sh [tier] (see CLAUDE.md for tier definitions). The agent runs Fast/Standard on the inner write-modify-test cycle. GPU and Calibration tiers gate at logical checkpoints (subagent return, MR, manual). The agent is never blocked waiting for slow tests during implementation.

The tier filter is marker-driven, so markers are the tier’s contract: an unmarked MCMC test silently lands in the fast tier and degrades it for everyone. Two mechanical guards back the convention (tests/conftest.py): integration implies statistical at collection time, and a duration tripwire fails any unmarked test exceeding PYTYCHE_TIER_TRIPWIRE_SECONDS (default 20s) in a single phase — setup included, which is what catches fit-running module-scope fixtures.

For per-test-cost profiling before running broader suites, see CLAUDE.md § “Running tests during development”.


Mocking: Only External Boundaries

Mock only what you don’t control. Everything else should be real.

External boundaries you don’t control:

  • Network (e.g., remote artefact stores, container registries)

  • Filesystem (unless testing pure transforms)

  • Time (clock-dependent logic in run monitor, sweep schedulers)

  • Randomness (when a test needs determinism beyond jax.random.PRNGKey)

  • GPU compute (when characterizing a fit at a contract level you mock the fit output, not the JAX call graph — see the four-layer BCF suite in CLAUDE-AGENT.md)

Everything else — your own modules, your own data structures, your own kernels — should be real in tests. This protects against the specific failure mode of AI-generated code: satisfying mock expectations while producing subtly wrong behavior.

Anti-pattern: testing mock interactions

# BAD — asserts on the mock, not the behavior
def test_calibration_recalibrates_quantiles():
    mock_iso = mock.Mock()
    mock_iso.predict.return_value = np.array([0.1, 0.5, 0.9])
    recalibrate(mock_iso, posterior_quantiles)
    mock_iso.predict.assert_called_once()
# GOOD — asserts on observable output
def test_isotonic_recalibration_maps_overconfident_quantiles_to_truth():
    # Posterior claims 90% coverage but only achieves 70% (overconfident).
    raw_quantiles = np.linspace(0.0, 1.0, 100)
    empirical_coverage = np.clip(raw_quantiles * 0.7, 0.0, 1.0)

    recalibrated = isotonic_recalibrate(raw_quantiles, empirical_coverage)

    # After recalibration, p=0.9 should map close to empirical 0.7.
    assert abs(recalibrated[90] - 0.63) < STATISTICAL

The first test passes if the mock is called correctly but says nothing about whether the calibration is correct. The second test is a failing test the agent can’t argue with.


Contract Tests and the Dispatch Flow

Contract tests are developed interactively, not auto-generated. The workflow:

  1. Human and agent collaborate on spec — via the OpenSpec change workflow

  2. Contract tests are written during spec development — agent proposes tests based on spec scenarios, human reviews and refines in chat

  3. Tests are confirmed to fail meaningfully — the test is watched failing to verify it tests the right thing

  4. Implementer is dispatched with the approved contract tests

  5. Implementer reads contract tests — understands the contract

  6. Implementer writes production code — satisfies the contract

  7. Implementer may write new tests — for internal helpers, discovered edge cases

  8. Implementer CANNOT edit existing tests — the contract is immutable during implementation

  9. Implementer reports back — pass/fail + any ambiguity surfaced via the ---QUESTION--- sentinel block (see CLAUDE-AGENT.md)

The contract tests are the definition of “done.” They’re written collaboratively with human review, and cannot be weakened during implementation.


OpenSpec Scenario to Test Mapping

Each scenario in an OpenSpec spec should map to at least one test. But scenarios describe behavior, not implementation. The test should assert on the behavior described in the scenario, not on the internal mechanism.

Spec scenario:
  "Given a hurdle BCF fit with true PEHE 0.15,
   when computing the 90% credible interval for tau,
   then the empirical coverage over 100 SBC seeds is within 0.05 of 0.90"

Test:
  def test_hurdle_bcf_90pct_ci_achieves_target_coverage():
      coverages = run_sbc_sweep(n_seeds=100, target_alpha=0.9, generator=hurdle_dgp)
      assert abs(coverages.mean() - 0.9) < 0.05  # SBC band

The test reads like the scenario. No implementation details leak in.


Statistical Testing (pytyche)

Statistical code requires a different testing approach. The tolerance IS the design contract:

Level

Tolerance

Use Case

EXACT

1e-10

Deterministic transforms, known inputs

TIGHT

1e-7

Numerical precision, floating point

NUMERICAL

1e-4

Solver outputs, convergence checks

STATISTICAL

1e-2

Parameter recovery from synthetic data

RECOVERY

0.1

MCMC posterior recovery, real data fits

BUSINESS

Context-dependent

Decision-relevant effect sizes

The assertion abs(posterior_mean - true_value) < RECOVERY means “the model must recover the true parameter within 0.1.” This is a testable contract, not a vague aspiration.


Named Patterns

Patterns that earned their place across the multi-arm and L2 changes. Use them by name in test designs.

Identity-pin for thin wrappers

When a public symbol is a re-export or a delegate to the same function object, equality of outputs is true by construction — output-equality suites test the test. Pin the delegation by identity (pt.fit_policy_tree is pytyche.analysis.fit_policy_tree) plus ONE behavioral smoke through the public name. Five output-equality suites for five re-exports is the anti-pattern.

Bridge tests transfer fingerprints

Expensive numerical fingerprints (GPU MCMC baselines) stay pinned on the private raw-array cores. A cheap bridge test pins the public wrapper bit-identical to its core on the same inputs. The composition transfers the baseline guarantee to the public surface without re-recording baselines every time the wrapper plumbing changes — and pure name-surface or wiring changes provably can’t touch numerics.

GPU gate cadence

GPU fingerprints re-run at numerics-adjacent section boundaries only — never per batch, never to confirm a rename, never concurrent with any other JAX process. The task tick states which it was (“not numerics-adjacent: no GPU re-run”).

Content-not-format for presentation surfaces

Reprs, prose summaries, and printed output are presentation, not contract. Tests assert key CONTENT is present (the decision, the numbers, the treatment name) and the rough shape (multi-line vs row) — never exact formatting. The format can be polished without breaking tests; a NaN-bearing input must repr without raising.

Parametrize over modes, don’t duplicate suites

When one surface has modes (pooling="joint" | "independent"), the behavioral suite parametrizes over the mode; mode-specific behavior gets targeted tests. Copy-pasted per-mode suites drift apart and double the maintenance surface.


Tooling

pytest JSON output

For programmatic consumption (agent-driven verification, CI gates), run pytest with --json-report to produce structured output rather than parsing human-readable text. Combine with --durations=10 to surface the slowest tests when triaging a suite that’s grown expensive.

LSP diagnostics

pyright LSP is active. Type errors surface on edit, before the agent even runs tests. This catches an entire class of bugs (type mismatches, wrong signatures, missing imports) at zero cost to the feedback loop. Treat a pyright error as a failing test — don’t ignore it, don’t add # type: ignore to dodge it. If types don’t fit, the design needs to change.


What This Document Informs

Artifact

What it takes from here

CLAUDE-AGENT.md

TDD methodology section, tolerance hierarchy reference

OpenSpec apply workflow

Contract-test dispatch flow, immutable-contract rule

Reviewer agent definition

Test quality checklist, anti-pattern detection


Living document. Updated as testing practices evolve.