Experiment Tracking

How pytyche’s calibration SBC sweeps + their downstream artifacts are versioned, logged, and reproduced: what each piece is for, how the pieces fit, and the canonical invocations.

Note

This page is for developers running calibration sweeps from a source checkout of the repository — the pipeline files it describes (dvc.yaml, params.yaml, scripts/) are not part of the installed package.


Pieces

Artifact

Where

Owns

dvc.yaml

repo root

The pipeline DAG: train_sweep {fit_corrections, test_sweep} evaluate_layered.

params.yaml

repo root

Sweep IDs + per-stage CLI arguments. Edit before each pilot run.

manifest.json

each sweep dir

Per-sweep metadata: experiment_id, git sha + dirty flag, env (python/platform), params, data_provenance, and the pytyche.calibration extension content. Schema at docs/specs/experiment-manifest-schema.json.

dvclive/

each sweep dir

Per-fit metrics (coverage_<level>, rmse, bias, wall_seconds) and params (config_seed, n_visitors, generator_family). Inspectable as TSV via cat sweep_dir/dvclive/metrics.tsv.

summary.json

per-config dir

Existing per-fit summary (preserved unchanged from pre-tracking sweeps; consumed by scripts/fit_sbc_correction.py).

aggregate.csv

each sweep dir

Existing per-sweep aggregate (preserved unchanged).


Canonical invocations

# Edit params.yaml first: sweep IDs (train_sweep.id, test_sweep.id), master_seed, n_configs.
dvc repro train_sweep         # ~5 GPU-hr at the default n_configs=50; ~10 GPU-hr at n_configs=100.
dvc repro test_sweep          # ~1 GPU-hr at n_configs=10.
dvc repro fit_corrections     # quick (~minutes); fits R(p) + scale-family from the train sweep.
dvc repro evaluate_layered    # quick; applies corrections to the test sweep and scores.

The sweep IDs in params.yaml are the canonical cross-sweep identifiers. They’re used both as DVC’s stable output paths (runs/<id>/) and as the manifest’s experiment_id cross-reference target (test sweeps link to their train sweep via pytyche.calibration.links.trained_correction_from).


Manifest schema (required top-level fields)

Field

Type

Notes

manifest_schema_version

integer

Currently 1. Monotonic; bumped on breaking schema changes.

experiment_id

string

{iso8601_utc}_{short_sha}, e.g. "2026-05-27T12-34-56Z_abc1234".

timestamp_utc

string

ISO 8601 UTC of experiment start.

git

object

{sha, dirty, branch}.

env

object

{python, platform} + optional library versions.

params

object

Free-form per-experiment hyperparameters.

data_provenance

object

Discriminated union: {kind: "synthetic", seed: int} or {kind: "external", hashes: {name: sha256}}.

pytyche

object

Reserved namespace for per-capability extension content. Currently has calibration.

pytyche.calibration extension content

Set automatically by scripts/calibration_sweep_v2.py:

{
  "pytyche": {
    "calibration": {
      "master_seed": 20260527,
      "n_configs": 50,
      "scales": [250000],
      "generator_family": "v2",
      "sweep_kind": "train",         // or "test", "exploratory"
      "save_samples": false,         // true for test sweeps (--save-samples)
      "links": {                     // only present on test sweeps
        "trained_correction_from": "2026-05-27T12-34-56Z_abc1234"
      }
    }
  }
}

Why DVC + dvclive (not MLflow / wandb)

  • File-based discipline. Inspect with cat/grep/jq/pandas — no SaaS, no UI as source of truth.

  • Git-versioned config + content-addressed artifacts. Sweep IDs in params.yaml are the cross-stage identifiers; DVC caches large outputs and gives reproducibility on top.

  • Reserved per-capability namespace. Future capabilities (e.g., a future Thompson-allocation tracking layer) extend the manifest under pytyche.<capability> rather than reinventing a manifest layer per-area.

MLflow, wandb, and Aim were considered and rejected: each makes a service or a UI the source of truth, where this workflow wants plain files that survive grep and version control.