pytyche.diagnostics.run_monitor¶

Resource monitoring primitives for GPU calibration sweeps.

Three components, each independent:

RunLog — Append-only JSONL event log (phase timing, events).
HWMonitor — Background nvidia-smi subprocess + /proc/stat sampler.
ResourceSummary — Aggregation of HWMonitor output into summary statistics.

Zero overhead on the Python/JAX process: GPU sampling runs in a separate OS process, CPU sampling reads /proc/stat in a daemon thread.

Functions

`build_resource_summary`(gpu_csv_path, ...[, ...])	Parse HWMonitor output files into a ResourceSummary.
`probe_jax_peak_memory_mb`()	Query JAX for actual peak device memory usage.

Classes

`HWMonitor`()	Background hardware sampler via nvidia-smi subprocess + /proc/stat.
`ResourceSummary`([peak_vram_mb, ...])	Aggregated resource utilization from HWMonitor output files.
`RunLog`()	Accumulates timestamped JSONL entries for a single fit run.

class pytyche.diagnostics.run_monitor.RunLog[source]¶

Bases: object

Accumulates timestamped JSONL entries for a single fit run.

Each line is a JSON object with at least ts and type keys. Designed for tail -f on remote pods and jq filtering.

log_phase(name, seconds, **extra)[source]¶

Record a completed phase with its wall-clock duration.

Parameters:

name (str)
seconds (float)
extra (Any)

Return type:

None

log_event(event_type, data=None)[source]¶

Record a point-in-time event (start, error, etc.).

Parameters:

event_type (str)
data (dict[str, Any] | None)

Return type:

None

write(path)[source]¶

Write all entries as newline-delimited JSON.

Parameters:: path (Path)
Return type:: None

class pytyche.diagnostics.run_monitor.HWMonitor[source]¶

Bases: object

Background hardware sampler via nvidia-smi subprocess + /proc/stat.

GPU: nvidia-smi writes CSV directly to disk (separate process). GPU dmon: nvidia-smi dmon captures SM/memory-controller utilization. CPU: daemon thread reads /proc/stat deltas (no external tool needed).

start(output_dir, interval_sec=2)[source]¶

Launch nvidia-smi, dmon, and CPU sampler.

Parameters:

output_dir (Path)
interval_sec (int)

Return type:

None

stop()[source]¶

Stop monitors, return (gpu_csv_path, cpu_csv_path, dmon_csv_path).

Returns None for paths where the monitor wasn’t available.

Return type:: tuple[Path | None, Path | None, Path | None]

class pytyche.diagnostics.run_monitor.ResourceSummary(peak_vram_mb=0.0, jax_peak_bytes_mb=0.0, gpu_util_mean_pct=0.0, gpu_util_max_pct=0.0, mem_util_mean_pct=0.0, mem_util_max_pct=0.0, power_watts_mean=0.0, power_watts_max=0.0, cpu_usr_mean_pct=0.0, cpu_idle_mean_pct=0.0)[source]¶

Bases: object

Aggregated resource utilization from HWMonitor output files.

Parameters:

peak_vram_mb (float)
jax_peak_bytes_mb (float)
gpu_util_mean_pct (float)
gpu_util_max_pct (float)
mem_util_mean_pct (float)
mem_util_max_pct (float)
power_watts_mean (float)
power_watts_max (float)
cpu_usr_mean_pct (float)
cpu_idle_mean_pct (float)

pytyche.diagnostics.run_monitor.probe_jax_peak_memory_mb()[source]¶

Query JAX for actual peak device memory usage.

Returns peak bytes in use (MB) from XLA’s allocator, which reflects real working set — not the pre-reserved pool from nvidia-smi. Returns 0.0 if JAX is not available or has no GPU devices.

Return type:: float

pytyche.diagnostics.run_monitor.build_resource_summary(gpu_csv_path, cpu_csv_path, dmon_csv_path=None)[source]¶

Parse HWMonitor output files into a ResourceSummary.

nvidia-smi CSV columns (no header, nounits):: timestamp, gpu_util%, mem_util%, mem_used_mb, mem_total_mb, power_w, temp_c, sm_clock_mhz, mem_clock_mhz
nvidia-smi dmon output (-s mu):: Header lines start with #. Data columns: gpu sm mem fb (SM%, memory-controller%, framebuffer MB)
CPU CSV columns (written by _CpuSampler):: timestamp, usr_pct, idle_pct

Parameters:

gpu_csv_path (Path | None)
cpu_csv_path (Path | None)
dmon_csv_path (Path | None)

Return type:

ResourceSummary