pytyche.diagnostics.run_monitor¶
Resource monitoring primitives for GPU calibration sweeps.
Three components, each independent:
RunLog — Append-only JSONL event log (phase timing, events).
HWMonitor — Background nvidia-smi subprocess + /proc/stat sampler.
ResourceSummary — Aggregation of HWMonitor output into summary statistics.
Zero overhead on the Python/JAX process: GPU sampling runs in a separate OS process, CPU sampling reads /proc/stat in a daemon thread.
Functions
|
Parse HWMonitor output files into a ResourceSummary. |
Query JAX for actual peak device memory usage. |
Classes
Background hardware sampler via nvidia-smi subprocess + /proc/stat. |
|
|
Aggregated resource utilization from HWMonitor output files. |
|
Accumulates timestamped JSONL entries for a single fit run. |
- class pytyche.diagnostics.run_monitor.RunLog[source]¶
Bases:
objectAccumulates timestamped JSONL entries for a single fit run.
Each line is a JSON object with at least
tsandtypekeys. Designed fortail -fon remote pods andjqfiltering.- log_phase(name, seconds, **extra)[source]¶
Record a completed phase with its wall-clock duration.
- Parameters:
name (
str)seconds (
float)extra (
Any)
- Return type:
None
- class pytyche.diagnostics.run_monitor.HWMonitor[source]¶
Bases:
objectBackground hardware sampler via nvidia-smi subprocess + /proc/stat.
GPU: nvidia-smi writes CSV directly to disk (separate process). GPU dmon: nvidia-smi dmon captures SM/memory-controller utilization. CPU: daemon thread reads /proc/stat deltas (no external tool needed).
- class pytyche.diagnostics.run_monitor.ResourceSummary(peak_vram_mb=0.0, jax_peak_bytes_mb=0.0, gpu_util_mean_pct=0.0, gpu_util_max_pct=0.0, mem_util_mean_pct=0.0, mem_util_max_pct=0.0, power_watts_mean=0.0, power_watts_max=0.0, cpu_usr_mean_pct=0.0, cpu_idle_mean_pct=0.0)[source]¶
Bases:
objectAggregated resource utilization from HWMonitor output files.
- Parameters:
peak_vram_mb (
float)jax_peak_bytes_mb (
float)gpu_util_mean_pct (
float)gpu_util_max_pct (
float)mem_util_mean_pct (
float)mem_util_max_pct (
float)power_watts_mean (
float)power_watts_max (
float)cpu_usr_mean_pct (
float)cpu_idle_mean_pct (
float)
- pytyche.diagnostics.run_monitor.probe_jax_peak_memory_mb()[source]¶
Query JAX for actual peak device memory usage.
Returns peak bytes in use (MB) from XLA’s allocator, which reflects real working set — not the pre-reserved pool from nvidia-smi. Returns 0.0 if JAX is not available or has no GPU devices.
- Return type:
float
- pytyche.diagnostics.run_monitor.build_resource_summary(gpu_csv_path, cpu_csv_path, dmon_csv_path=None)[source]¶
Parse HWMonitor output files into a ResourceSummary.
- nvidia-smi CSV columns (no header, nounits):
timestamp, gpu_util%, mem_util%, mem_used_mb, mem_total_mb, power_w, temp_c, sm_clock_mhz, mem_clock_mhz
- nvidia-smi dmon output (
-s mu): Header lines start with
#. Data columns: gpu sm mem fb (SM%, memory-controller%, framebuffer MB)- CPU CSV columns (written by _CpuSampler):
timestamp, usr_pct, idle_pct
- Parameters:
gpu_csv_path (
Path|None)cpu_csv_path (
Path|None)dmon_csv_path (
Path|None)
- Return type: