pytyche.calibrate.scorecard

v2 scorecard — decision summary and per-scenario aggregation.

Functions

compute_scorecard(records)

Group CalibrationRecords by scenario_id and compute per-group metrics.

Classes

CellRegretStats(mean, median)

Per-cell regret statistics for a single oracle × actual decision pair.

DecisionSummary(n_correct, n_false_ship, ...)

Summary of oracle-vs-actual decision accuracy.

ScenarioScorecard(scenario_id, ...)

Per-scenario aggregated calibration metrics.

class pytyche.calibrate.scorecard.CellRegretStats(mean, median)[source]

Bases: object

Per-cell regret statistics for a single oracle × actual decision pair.

mean

Mean regret across all records in this cell; None if no records.

median

Median regret across all records in this cell; None if no records.

Parameters:
  • mean (float | None)

  • median (float | None)

class pytyche.calibrate.scorecard.DecisionSummary(n_correct, n_false_ship, n_missed_win, decision_matrix, cell_regret)[source]

Bases: object

Summary of oracle-vs-actual decision accuracy.

Convenience fields (n_correct, n_false_ship, n_missed_win) are derived views of decision_matrix for the current 2-arm phase. When multi-arm support lands, the matrix naturally extends without API breakage.

n_correct

Number of decisions that matched the oracle.

n_false_ship

Shipped when oracle says don’t (oracle != SHIP, actual == SHIP).

n_missed_win

Didn’t ship when oracle says ship (oracle == SHIP, actual != SHIP).

decision_matrix

{oracle_decision_value: {actual_decision_value: count}}. Keys are Decision enum values (strings).

Parameters:
  • n_correct (int)

  • n_false_ship (int)

  • n_missed_win (int)

  • decision_matrix (dict[str, dict[str, int]])

  • cell_regret (dict[str, dict[str, CellRegretStats]])

class pytyche.calibrate.scorecard.ScenarioScorecard(scenario_id, n_records_total, n_records_used, decision_summary, coverage_rate, bias, rmse, false_ship_rate, missed_win_rate, mean_regret, mean_regret_cpm)[source]

Bases: object

Per-scenario aggregated calibration metrics.

All 7 metric fields are float | None. When n_records_used == 0, all metrics are None.

scenario_id

Scenario identifier (from CalibrationRecord).

n_records_total

Count of ALL records with this scenario_id (pre-filter).

n_records_used

Count of HONEST_ESTIMATE records (post-filter).

decision_summary

Decision accuracy summary (counts only, no rates).

coverage_rate

Fraction of CIs containing the true effect [0, 1].

bias

Mean(estimated_lift - effect) in metric-native units.

rmse

sqrt(mean((estimated_lift - effect)^2)) in metric-native units.

false_ship_rate

n_false_ship / n_records_used (total-denominator).

missed_win_rate

n_missed_win / n_records_used (total-denominator).

mean_regret

Mean of non-None regret values; None if ALL are None.

mean_regret_cpm

mean_regret * 1000 if mean_regret is not None, else None.

Parameters:
  • scenario_id (str)

  • n_records_total (int)

  • n_records_used (int)

  • decision_summary (DecisionSummary)

  • coverage_rate (float | None)

  • bias (float | None)

  • rmse (float | None)

  • false_ship_rate (float | None)

  • missed_win_rate (float | None)

  • mean_regret (float | None)

  • mean_regret_cpm (float | None)

pytyche.calibrate.scorecard.compute_scorecard(records)[source]

Group CalibrationRecords by scenario_id and compute per-group metrics.

Filters records to analysis_mode == ClaimLevel.HONEST_ESTIMATE before computing metrics. Both n_records_total (pre-filter) and n_records_used (post-filter) are surfaced on each ScenarioScorecard.

Parameters:

records (list[CalibrationRecord]) – Flat list of CalibrationRecords from one or more scenarios.

Return type:

list[ScenarioScorecard]

Returns:

List of ScenarioScorecards, one per unique scenario_id, sorted by scenario_id for consistent ordering.