pytyche.diagnostics.run_monitor

Resource monitoring primitives for GPU calibration sweeps.

Three components, each independent:

  • RunLog — Append-only JSONL event log (phase timing, events).

  • HWMonitor — Background nvidia-smi subprocess + /proc/stat sampler.

  • ResourceSummary — Aggregation of HWMonitor output into summary statistics.

Zero overhead on the Python/JAX process: GPU sampling runs in a separate OS process, CPU sampling reads /proc/stat in a daemon thread.

Functions

build_resource_summary(gpu_csv_path, ...[, ...])

Parse HWMonitor output files into a ResourceSummary.

probe_jax_peak_memory_mb()

Query JAX for actual peak device memory usage.

Classes

HWMonitor()

Background hardware sampler via nvidia-smi subprocess + /proc/stat.

ResourceSummary([peak_vram_mb, ...])

Aggregated resource utilization from HWMonitor output files.

RunLog()

Accumulates timestamped JSONL entries for a single fit run.

class pytyche.diagnostics.run_monitor.RunLog[source]

Bases: object

Accumulates timestamped JSONL entries for a single fit run.

Each line is a JSON object with at least ts and type keys. Designed for tail -f on remote pods and jq filtering.

log_phase(name, seconds, **extra)[source]

Record a completed phase with its wall-clock duration.

Parameters:
  • name (str)

  • seconds (float)

  • extra (Any)

Return type:

None

log_event(event_type, data=None)[source]

Record a point-in-time event (start, error, etc.).

Parameters:
  • event_type (str)

  • data (dict[str, Any] | None)

Return type:

None

write(path)[source]

Write all entries as newline-delimited JSON.

Parameters:

path (Path)

Return type:

None

class pytyche.diagnostics.run_monitor.HWMonitor[source]

Bases: object

Background hardware sampler via nvidia-smi subprocess + /proc/stat.

GPU: nvidia-smi writes CSV directly to disk (separate process). GPU dmon: nvidia-smi dmon captures SM/memory-controller utilization. CPU: daemon thread reads /proc/stat deltas (no external tool needed).

start(output_dir, interval_sec=2)[source]

Launch nvidia-smi, dmon, and CPU sampler.

Parameters:
  • output_dir (Path)

  • interval_sec (int)

Return type:

None

stop()[source]

Stop monitors, return (gpu_csv_path, cpu_csv_path, dmon_csv_path).

Returns None for paths where the monitor wasn’t available.

Return type:

tuple[Path | None, Path | None, Path | None]

class pytyche.diagnostics.run_monitor.ResourceSummary(peak_vram_mb=0.0, jax_peak_bytes_mb=0.0, gpu_util_mean_pct=0.0, gpu_util_max_pct=0.0, mem_util_mean_pct=0.0, mem_util_max_pct=0.0, power_watts_mean=0.0, power_watts_max=0.0, cpu_usr_mean_pct=0.0, cpu_idle_mean_pct=0.0)[source]

Bases: object

Aggregated resource utilization from HWMonitor output files.

Parameters:
  • peak_vram_mb (float)

  • jax_peak_bytes_mb (float)

  • gpu_util_mean_pct (float)

  • gpu_util_max_pct (float)

  • mem_util_mean_pct (float)

  • mem_util_max_pct (float)

  • power_watts_mean (float)

  • power_watts_max (float)

  • cpu_usr_mean_pct (float)

  • cpu_idle_mean_pct (float)

pytyche.diagnostics.run_monitor.probe_jax_peak_memory_mb()[source]

Query JAX for actual peak device memory usage.

Returns peak bytes in use (MB) from XLA’s allocator, which reflects real working set — not the pre-reserved pool from nvidia-smi. Returns 0.0 if JAX is not available or has no GPU devices.

Return type:

float

pytyche.diagnostics.run_monitor.build_resource_summary(gpu_csv_path, cpu_csv_path, dmon_csv_path=None)[source]

Parse HWMonitor output files into a ResourceSummary.

nvidia-smi CSV columns (no header, nounits):

timestamp, gpu_util%, mem_util%, mem_used_mb, mem_total_mb, power_w, temp_c, sm_clock_mhz, mem_clock_mhz

nvidia-smi dmon output (-s mu):

Header lines start with #. Data columns: gpu sm mem fb (SM%, memory-controller%, framebuffer MB)

CPU CSV columns (written by _CpuSampler):

timestamp, usr_pct, idle_pct

Parameters:
  • gpu_csv_path (Path | None)

  • cpu_csv_path (Path | None)

  • dmon_csv_path (Path | None)

Return type:

ResourceSummary