--- title: "Experiment Tracking" review-state: drafting last-human-review: "2026-06-03" depends-on: - scripts/calibration_sweep_v2.py - dvc.yaml - params.yaml owner: unowned quadrant: concept --- # Experiment Tracking How pytyche's calibration SBC sweeps + their downstream artifacts are versioned, logged, and reproduced: what each piece is for, how the pieces fit, and the canonical invocations. :::{note} This page is for developers running calibration sweeps from a source checkout of the repository — the pipeline files it describes (`dvc.yaml`, `params.yaml`, `scripts/`) are not part of the installed package. ::: --- ## Pieces | Artifact | Where | Owns | |---|---|---| | `dvc.yaml` | repo root | The pipeline DAG: `train_sweep → {fit_corrections, test_sweep} → evaluate_layered`. | | `params.yaml` | repo root | Sweep IDs + per-stage CLI arguments. Edit before each pilot run. | | `manifest.json` | each sweep dir | Per-sweep metadata: experiment_id, git sha + dirty flag, env (python/platform), params, data_provenance, and the `pytyche.calibration` extension content. Schema at `docs/specs/experiment-manifest-schema.json`. | | `dvclive/` | each sweep dir | Per-fit metrics (`coverage_`, `rmse`, `bias`, `wall_seconds`) and params (`config_seed`, `n_visitors`, `generator_family`). Inspectable as TSV via `cat sweep_dir/dvclive/metrics.tsv`. | | `summary.json` | per-config dir | Existing per-fit summary (preserved unchanged from pre-tracking sweeps; consumed by `scripts/fit_sbc_correction.py`). | | `aggregate.csv` | each sweep dir | Existing per-sweep aggregate (preserved unchanged). | --- ## Canonical invocations ```bash # Edit params.yaml first: sweep IDs (train_sweep.id, test_sweep.id), master_seed, n_configs. dvc repro train_sweep # ~5 GPU-hr at the default n_configs=50; ~10 GPU-hr at n_configs=100. dvc repro test_sweep # ~1 GPU-hr at n_configs=10. dvc repro fit_corrections # quick (~minutes); fits R(p) + scale-family from the train sweep. dvc repro evaluate_layered # quick; applies corrections to the test sweep and scores. ``` The sweep IDs in `params.yaml` are the **canonical cross-sweep identifiers**. They're used both as DVC's stable output paths (`runs//`) and as the manifest's `experiment_id` cross-reference target (test sweeps link to their train sweep via `pytyche.calibration.links.trained_correction_from`). --- ## Manifest schema (required top-level fields) | Field | Type | Notes | |---|---|---| | `manifest_schema_version` | integer | Currently `1`. Monotonic; bumped on breaking schema changes. | | `experiment_id` | string | `{iso8601_utc}_{short_sha}`, e.g. `"2026-05-27T12-34-56Z_abc1234"`. | | `timestamp_utc` | string | ISO 8601 UTC of experiment start. | | `git` | object | `{sha, dirty, branch}`. | | `env` | object | `{python, platform}` + optional library versions. | | `params` | object | Free-form per-experiment hyperparameters. | | `data_provenance` | object | Discriminated union: `{kind: "synthetic", seed: int}` or `{kind: "external", hashes: {name: sha256}}`. | | `pytyche` | object | Reserved namespace for per-capability extension content. Currently has `calibration`. | ### `pytyche.calibration` extension content Set automatically by `scripts/calibration_sweep_v2.py`: ```json { "pytyche": { "calibration": { "master_seed": 20260527, "n_configs": 50, "scales": [250000], "generator_family": "v2", "sweep_kind": "train", // or "test", "exploratory" "save_samples": false, // true for test sweeps (--save-samples) "links": { // only present on test sweeps "trained_correction_from": "2026-05-27T12-34-56Z_abc1234" } } } } ``` --- ## Why DVC + dvclive (not MLflow / wandb) - **File-based discipline.** Inspect with `cat`/`grep`/`jq`/`pandas` — no SaaS, no UI as source of truth. - **Git-versioned config + content-addressed artifacts.** Sweep IDs in `params.yaml` are the cross-stage identifiers; DVC caches large outputs and gives reproducibility on top. - **Reserved per-capability namespace.** Future capabilities (e.g., a future Thompson-allocation tracking layer) extend the manifest under `pytyche.` rather than reinventing a manifest layer per-area. MLflow, wandb, and Aim were considered and rejected: each makes a service or a UI the source of truth, where this workflow wants plain files that survive `grep` and version control.