pytyche.generators.core¶

V2 generator core — single potential-outcomes pipeline.

Public API¶

generate_v2_core(config: V2GeneratorConfig) -> CalibrationBundle

Pipeline stages (filled in by subsequent tasks):

sample_features(config, rng) -> X: pd.DataFrame
compute_potential_outcomes(X, params) -> truth arrays
assign_and_sample(X, truth, assignment, rng) -> observed outcomes
build_bundle(observed, truth) -> CalibrationBundle

Config design¶

Typed frozen dataclasses with __post_init__ fail-closed validation. Feature sampler is a discriminated union: CopulaConfig | MixtureConfig. Both strategies produce the same output shape (X: pd.DataFrame). Surfaces are Callable[[pd.DataFrame], np.ndarray].

Module Attributes

FeatureSamplerConfig

Discriminated union of feature sampler strategy configs.

Functions

`add_surfaces`(s1, s2)	Return a surface whose output is the element-wise sum of two surfaces.
`assign_and_observe`(features, truth_result, ...)	Assign visitors to treatment levels and sample observed outcomes.
`build_bundle`(variants, truth_result, ...)	Assemble a CalibrationBundle from variant list and truth.
`build_truth`(truth_result, metric_mode)	Construct a CalibrationTruth from computed potential outcomes.
`compute_potential_outcomes`(features, metric_mode)	Compute per-visitor potential outcomes and derive truth arrays.
`generate_v2_core`(config)	Generate a CalibrationBundle via the v2 potential-outcomes pipeline.
`multiply_surfaces`(s1, s2)	Return a surface whose output is the element-wise product of two surfaces.
`sample_features`(config, rng, n)	Sample a feature matrix from the given sampler config.
`sigmoid_surface`(s)	Return a surface applying the sigmoid transform to an inner surface.
`threshold_surface`(s, cutoff, above, below)	Return a step-function surface based on a threshold applied to an inner surface.

Classes

`AssignmentConfig`([treatment_allocation, ...])	Fixed (forced) treatment-assignment policy.
`CopulaConfig`(features, correlation)	Gaussian copula feature sampler configuration.
`FeatureSpec`(name, kind)	Specification for a single feature column.
`MetricMode`(metric_id[, p0_surface, ...])	Metric-specific potential outcome surfaces.
`MixtureConfig`(features, weights, cluster_params)	Mixture-of-populations feature sampler configuration.
`SurfaceConfig`(fn)	HTE surface definition as a callable.
`TruthResult`(cate_per_visitor, effect, ...[, ...])	Internal intermediate result from compute_potential_outcomes.
`V2GeneratorConfig`(n_visitors, ...[, ...])	Top-level configuration for the v2 core generator pipeline.

class pytyche.generators.core.FeatureSpec(name, kind)[source]¶

Bases: object

Specification for a single feature column.

name¶: Column name. Must be non-empty.

kind¶: Feature type — "continuous", "categorical", or "binary".

Parameters:

name (str)
kind (Literal['continuous', 'categorical', 'binary'])

class pytyche.generators.core.CopulaConfig(features, correlation)[source]¶

Bases: object

Gaussian copula feature sampler configuration.

Generates correlated mixed covariates via multivariate normal with configured correlation matrix and marginal transforms.

features¶: Ordered feature specifications. Length determines the dimension of the copula.

correlation¶: Square correlation matrix of shape (len(features), len(features)). Must be positive semi-definite.

Parameters:

features (tuple[FeatureSpec, ...])
correlation (ndarray)

class pytyche.generators.core.MixtureConfig(features, weights, cluster_params)[source]¶

Bases: object

Mixture-of-populations feature sampler configuration.

Generates covariates by sampling latent cluster membership and then drawing from cluster-conditional feature distributions.

features¶: Feature specifications shared across all clusters.

weights¶: Per-cluster mixture weights. All must be strictly positive. Length must equal len(cluster_params).

cluster_params¶: Per-cluster parameter dicts keyed by feature name (or by arbitrary convention per stage implementation).

Parameters:

features (tuple[FeatureSpec, ...])
weights (tuple[float, ...])
cluster_params (tuple[dict[str, Any], ...])

pytyche.generators.core.FeatureSamplerConfig = pytyche.generators.core.CopulaConfig | pytyche.generators.core.MixtureConfig¶: Discriminated union of feature sampler strategy configs.

class pytyche.generators.core.SurfaceConfig(fn)[source]¶

Bases: object

HTE surface definition as a callable.

The callable receives the feature DataFrame X (one row per visitor) and returns a 1-D numpy array of per-visitor values (e.g. conversion probabilities, severity means).

fn¶: (X: pd.DataFrame) -> np.ndarray — must return a 1-D array of length len(X).

Parameters:: fn (Callable[[DataFrame], ndarray])

class pytyche.generators.core.AssignmentConfig(treatment_allocation=0.5, *, treatment_probabilities=None)[source]¶

Bases: object

Fixed (forced) treatment-assignment policy.

Configures the FIXED/forced assignment used for simple randomized designs and SBC — every visitor’s treatment is drawn from a feature-independent distribution (binary 50/50 or a uniform/fixed per-treatment vector). Policy-routed assignment — where a sequential experiment’s cell routes each visitor to a treatment based on their features — is NOT encoded here; it is supplied externally at observation time. This config is just one assignment policy (the forced/randomized one).

Binary (paired) form: set treatment_allocation (fraction in (0, 1)); leave treatment_probabilities as None.

Multi-treatment form: set treatment_probabilities to a length-K tuple of positive fractions summing to 1.0 (within 1e-9) — the fixed per-treatment assignment probabilities (uniform = balanced forced exploration). treatment_allocation is unused in this form but must still be in (0, 1).

treatment_allocation¶: Fraction of visitors assigned to treatment in the binary form. Must be in the open interval (0, 1). Default 0.5.

treatment_probabilities¶: Fixed per-treatment assignment probabilities for K≥2 (the forced/randomized policy). None selects the binary paired form. Length must be ≥ 2, every entry strictly in (0, 1), and entries must sum to 1.0 (±1e-9).

Parameters:

treatment_allocation (float)
treatment_probabilities (tuple[float, ...] | None)

class pytyche.generators.core.MetricMode(metric_id, p0_surface=None, p1_surface=None, m0_surface=None, m1_surface=None, sigma0_surface=None, sigma1_surface=None, *, p_surfaces=None, m_surfaces=None, sigma_surfaces=None)[source]¶

Bases: object

Metric-specific potential outcome surfaces.

K=2 (paired) form: populate the six paired surface fields (p0_surface, p1_surface, and for hurdle metrics m0/m1/sigma0/sigma1); leave p_surfaces, m_surfaces, and sigma_surfaces as None.

Multi-treatment form: leave all six paired surface fields as None and populate p_surfaces (length K) and, for hurdle metrics, m_surfaces and sigma_surfaces (also length K). K must be ≥ 3 — the K = 2 case uses the paired form (the truth-computation dispatch routes list-form surfaces through the multi-treatment path only at K ≥ 3).

Every surface is a function of the individual user’s attributes X. A SurfaceConfig carries the potential-outcome surface for ONE treatment level: p_surfaces[k](X) is the conversion potential outcome a user with features X WOULD show under treatment level k (level 0 = control, the prognostic baseline; levels 1…K−1 = treated). The surfaces are not properties of an assignment cohort — assignment (randomization, and in a sequential experiment a cell’s policy) merely selects which of a user’s K potential outcomes is observed.

For binary metrics ("conversion_rate"): only conversion probability surfaces are needed; severity/sigma surfaces must be None.

For hurdle metrics ("revenue_per_visitor"): conversion, severity, and sigma surfaces are all required.

metric_id¶: Canonical metric name. Must be a known metric.

p0_surface¶: Control (level-0) conversion potential-outcome surface (paired form).

p1_surface¶: Treated (level-1) conversion potential-outcome surface (paired form).

m0_surface¶: Control severity (AOV) surface. Hurdle paired form only.

m1_surface¶: Treated severity (AOV) surface. Hurdle paired form only.

sigma0_surface¶: Control LogNormal dispersion surface. Hurdle paired only.

sigma1_surface¶: Treated LogNormal dispersion surface. Hurdle paired only.

p_surfaces¶: Conversion potential-outcome surfaces, one per treatment level (length K, index 0 = control). Each is a function of user attributes X.

m_surfaces¶: Severity potential-outcome surfaces, one per treatment level (length K). Hurdle multi-treatment form.

sigma_surfaces¶: LogNormal dispersion surfaces, one per treatment level (length K). Hurdle multi-treatment form.

Parameters:

metric_id (str)
p0_surface (SurfaceConfig | None)
p1_surface (SurfaceConfig | None)
m0_surface (SurfaceConfig | None)
m1_surface (SurfaceConfig | None)
sigma0_surface (SurfaceConfig | None)
sigma1_surface (SurfaceConfig | None)
p_surfaces (list[SurfaceConfig] | None)
m_surfaces (list[SurfaceConfig] | None)
sigma_surfaces (list[SurfaceConfig] | None)

class pytyche.generators.core.V2GeneratorConfig(n_visitors, feature_sampler, metric_mode, assignment, seed, experiment_id='sim-v2', revenue_model=None)[source]¶

Bases: object

Top-level configuration for the v2 core generator pipeline.

n_visitors¶: Total visitor count across all arms. Must be positive.

feature_sampler¶: Feature sampling strategy config (CopulaConfig or MixtureConfig).

metric_mode¶: Metric and potential-outcome surface definitions.

assignment¶: Treatment assignment parameters.

seed¶: Random seed for full pipeline reproducibility.

experiment_id¶: Identifier for the generated experiment. Default "sim-v2".

revenue_model¶: Revenue sampling strategy for hurdle metrics. None (default) uses LogNormal sampling (existing behavior). A CartRevenueConfig activates cart-based revenue sampling — per-category Bernoulli draws whose prices sum to converter revenue. Ignored for binary (conversion_rate) metrics.

Parameters:

n_visitors (int)
feature_sampler (CopulaConfig | MixtureConfig)
metric_mode (MetricMode)
assignment (AssignmentConfig)
seed (int)
experiment_id (str)
revenue_model (CartRevenueConfig | None)

pytyche.generators.core.sample_features(config, rng, n)[source]¶

Sample a feature matrix from the given sampler config.

Parameters:

config (CopulaConfig | MixtureConfig) – Either a CopulaConfig (Gaussian copula strategy) or a MixtureConfig (mixture-of-populations strategy).
rng (Generator) – Explicit RNG instance — caller controls seeding for reproducibility.
n (int) – Number of rows to sample.

Returns:

Shape (n, len(config.features)). Column names and dtypes follow the feature specs in config.features. For MixtureConfig an additional cluster_id column is included (used internally for per-cluster surface dispatch; stripped before output by generate_v2_core).

Return type:

DataFrame

Raises:

TypeError – Raised for unsupported config types.
ValueError – Raised for unknown feature kinds within a config.

class pytyche.generators.core.TruthResult(cate_per_visitor, effect, effect_components, conv_per_visitor=None, aov_per_visitor=None, p0_values=None, p1_values=None, m0_values=None, m1_values=None, m0_effective=None, m1_effective=None, sigma0_values=None, sigma1_values=None, p_values=None, m_values=None, sigma_values=None)[source]¶

Bases: object

Internal intermediate result from compute_potential_outcomes.

Holds per-visitor CATE arrays and population summary statistics before assembly into a CalibrationBundle. This is not part of the public contract — build_bundle (task 7.2) constructs CalibrationTruth from this.

K-dispatch:

K=2 (paired form): populates the legacy paired *_values fields (p0_values, p1_values, and for hurdle m0/m1/sigma0/sigma1_values) plus cate_per_visitor; leaves p_values, m_values, sigma_values as None.

K≥3 (multi-treatment form): populates p_values (length K), and for hurdle m_values / sigma_values (also length K); sets cate_per_visitor=None and leaves all six legacy paired *_values fields as None.

cate_per_visitor¶: Per-visitor CATE, aligned 1:1 with input rows. Populated at K=2; None at K≥3.

effect¶: Population mean CATE (mean of cate_per_visitor.values at K=2; mean of the best-treatment contrast at K≥3 — provisional scalar).

effect_components¶: Named decomposition of the population effect. Binary K=2: {"conv_effect": float} Hurdle K=2: {"conv_effect": float, "aov_effect": float} K≥3: {"treatment_1_effect": float, ..., "treatment_{K-1}_effect": float}

conv_per_visitor¶: Per-visitor conversion component. Only populated for hurdle K=2 metrics; None for binary or K≥3.

aov_per_visitor¶: Per-visitor AOV component. Only populated for hurdle K=2 metrics; None for binary or K≥3.

p0_values¶: Per-visitor control conversion probabilities. Populated by both binary and hurdle K=2 computations for use in outcome sampling.

p1_values¶: Per-visitor treatment conversion probabilities. K=2 only.

m0_values¶: Per-visitor control severity means (raw surface output). Hurdle K=2 only; None for binary or K≥3.

m1_values¶: Per-visitor treatment severity means (raw surface output). Hurdle K=2 only; None for binary or K≥3.

m0_effective¶: Per-visitor control expected revenue conditional on conversion. For lognormal, equals m0_values. For cart, this is the analytical expected cart revenue. None for binary or K≥3.

m1_effective¶: Per-visitor treatment expected revenue conditional on conversion. For lognormal, equals m1_values. For cart, this is the analytical expected cart revenue. None for binary or K≥3.

sigma0_values¶: Per-visitor control LogNormal dispersion. Hurdle K=2 only; None for binary or K≥3.

sigma1_values¶: Per-visitor treatment LogNormal dispersion. Hurdle K=2 only; None for binary or K≥3.

p_values¶: Per-visitor conversion potential outcomes, one array per treatment level (length K, index 0 = control). K≥3 only; None at K=2.

m_values¶: Per-visitor severity potential outcomes (length K). K≥3 hurdle only; None for K=2 or conversion_rate.

sigma_values¶: Per-visitor LogNormal dispersion (length K). K≥3 hurdle only; None for K=2 or conversion_rate.

Parameters:

cate_per_visitor (AlignedVisitorArray | None)
effect (float)
effect_components (dict[str, float])
conv_per_visitor (AlignedVisitorArray | None)
aov_per_visitor (AlignedVisitorArray | None)
p0_values (ndarray | None)
p1_values (ndarray | None)
m0_values (ndarray | None)
m1_values (ndarray | None)
m0_effective (ndarray | None)
m1_effective (ndarray | None)
sigma0_values (ndarray | None)
sigma1_values (ndarray | None)
p_values (list[ndarray] | None)
m_values (list[ndarray] | None)
sigma_values (list[ndarray] | None)

pytyche.generators.core.compute_potential_outcomes(features, metric_mode, revenue_model=None)[source]¶

Compute per-visitor potential outcomes and derive truth arrays.

Dispatches based on metric_mode.n_treatments:

n_treatments == 2 (paired form): routes to _compute_binary_truth or _compute_hurdle_truth exactly as before.
n_treatments >= 3 (multi-treatment form): routes to _compute_truth_multi.

Parameters:

features (DataFrame) – Feature matrix, one row per visitor. Produced by sample_features.
metric_mode (MetricMode) – Metric and surface definitions. Determines which computation path to use (binary, hurdle, or multi-treatment).
revenue_model (CartRevenueConfig | None) – Cart revenue configuration. When provided, the hurdle decomposition uses analytical expected cart revenue instead of the raw severity surface values. Ignored for binary metrics. Not supported for K≥3.

Returns:

Per-visitor CATE, population effect, and decomposition components.

Return type:

TruthResult

Raises:

ValueError – If metric_mode.metric_id is not a supported metric.
NotImplementedError – If revenue_model is non-None and n_treatments >= 3.

pytyche.generators.core.build_truth(truth_result, metric_mode)[source]¶

Construct a CalibrationTruth from computed potential outcomes.

Bridges the internal TruthResult (pipeline intermediate) to the public CalibrationTruth contract.

K-dispatch:

K=2 (paired form): populates the legacy 1-D fields (cate_per_visitor, conv/aov_cate_per_visitor, p0/p1/m0/m1_per_visitor); leaves the three new list fields (contrast_cate_per_visitor, p_per_visitor, m_per_visitor) as None.

K≥3 (multi-treatment form): populates contrast_cate_per_visitor (length K−1), p_per_visitor (length K), and m_per_visitor (length K, or None for conversion_rate); leaves all legacy 1-D fields as None.

Parameters:

truth_result (TruthResult) – Output of compute_potential_outcomes — holds per-visitor CATE, population effect, and decomposition components.
metric_mode (MetricMode) – The metric configuration used to compute truth — provides metric_id for family derivation.

Returns:

Typed, frozen ground truth.

Return type:

CalibrationTruth

pytyche.generators.core.add_surfaces(s1, s2)[source]¶

Return a surface whose output is the element-wise sum of two surfaces.

Parameters:

s1 (SurfaceConfig) – Input surfaces. Both receive the same feature DataFrame X.
s2 (SurfaceConfig) – Input surfaces. Both receive the same feature DataFrame X.

Returns:

Surface computing s1(X) + s2(X).

Return type:

SurfaceConfig

pytyche.generators.core.multiply_surfaces(s1, s2)[source]¶

Return a surface whose output is the element-wise product of two surfaces.

Parameters:

s1 (SurfaceConfig) – Input surfaces. Both receive the same feature DataFrame X.
s2 (SurfaceConfig) – Input surfaces. Both receive the same feature DataFrame X.

Returns:

Surface computing s1(X) * s2(X).

Return type:

SurfaceConfig

pytyche.generators.core.sigmoid_surface(s)[source]¶

Return a surface applying the sigmoid transform to an inner surface.

The sigmoid maps any real value to the open interval (0, 1):

sigmoid(z) = 1 / (1 + exp(-z))

Useful for composing linear or nonlinear terms into a valid probability surface without manual clipping.

Parameters:: s (SurfaceConfig) – Inner surface whose output is passed through sigmoid.
Returns:: Surface computing sigmoid(s(X)).
Return type:: SurfaceConfig

pytyche.generators.core.threshold_surface(s, cutoff, above, below)[source]¶

Return a step-function surface based on a threshold applied to an inner surface.

Produces a discontinuous surface suitable for stress-testing HTE estimators that assume smooth effect functions:

result[i] = above  if s(X)[i] > cutoff
            below  otherwise

Parameters:

s (SurfaceConfig) – Inner surface to threshold.
cutoff (float) – Decision boundary. Strict > comparison.
above (float) – Output value where s(X) > cutoff.
below (float) – Output value where s(X) <= cutoff.

Returns:

Step-function surface.

Return type:

SurfaceConfig

pytyche.generators.core.assign_and_observe(features, truth_result, assignment, metric_id, rng, revenue_model=None, treatment_assignment=None)[source]¶

Assign visitors to treatment levels and sample observed outcomes.

Dispatches to one of two paths based on the truth structure:

Paired/binary path (treatment_assignment is None AND truth_result.p_values is None): the existing K=2 binary/hurdle sampling unchanged — RNG draws are byte-identical to the original implementation. Returns a 2-element list [control, treatment].

Multi-treatment path (truth_result.p_values is not None, i.e. K≥3 truth): generalised to K treatment levels. Assignment is either:

Internal randomisation (treatment_assignment is None): uses assignment.treatment_probabilities to draw per-visitor treatment indices from a categorical distribution.
External hook (treatment_assignment supplied): the caller provides a per-visitor array of treatment indices in [0, K−1] (e.g. a sequential-experiment cell’s policy routing).

When revenue_model is a CartRevenueConfig, cart-based revenue sampling is used (K=2 only). Supplying a revenue_model for K≥3 raises NotImplementedError.

Parameters:

features (DataFrame) – Feature matrix produced by sample_features, one row per visitor.
truth_result (TruthResult) – Output of compute_potential_outcomes. K=2: carries p0_values and p1_values (and hurdle arrays). K≥3: carries p_values (and m_values/sigma_values for hurdle metrics).
assignment (AssignmentConfig) – Treatment assignment parameters. Paired path uses treatment_allocation; multi-treatment internal path uses treatment_probabilities.
metric_id (str) – Canonical metric identifier — determines revenue sampling path.
rng (Generator) – Caller-controlled RNG for reproducible assignment and sampling.
revenue_model (CartRevenueConfig | None) – Revenue sampling strategy for hurdle metrics. None (default) uses LogNormal sampling. A CartRevenueConfig activates cart-based sampling (K=2 only). Ignored for binary metrics.
treatment_assignment (ndarray | None) – External per-visitor treatment-index array (ints in [0, K−1]). When supplied, bypasses internal randomisation and drives group membership exactly — suitable for the sequential-experiment loop where a cell’s policy routes visitors by features. Requires multi-treatment truth (truth_result.p_values non-None); supplying this with K=2 paired truth raises ValueError.

Returns:

Length K. Index 0 = control (name="control"); indices 1…K−1 are name="treatment_k" for K≥3, or name="treatment" for K=2. All VariantData contain observed columns and feature columns only (no truth fields).

Return type:

list[VariantData]

Raises:

ValueError – If truth_result is missing required potential outcome arrays, if treatment_assignment is supplied with K=2 paired truth, or if treatment_assignment has invalid length or out-of-range indices.
NotImplementedError – If revenue_model is non-None and K≥3 truth is present.

pytyche.generators.core.build_bundle(variants, truth_result, metric_mode, experiment_id)[source]¶

Assemble a CalibrationBundle from variant list and truth.

Accepts a list of K VariantData (K=2 for paired designs, K≥3 for multi-treatment). Stamps the canonical experiment_id onto each variant, validates the observed data schema, and performs an alignment check appropriate for the truth form:

K=2 (truth.cate_per_visitor non-None): standard CATE alignment.
K≥3 (truth.cate_per_visitor is None): aligns truth.p_per_visitor[0] against the total observed visitor count.

Fail-closed: raises on any validation or alignment violation.

Parameters:

variants (list[VariantData]) – List of VariantData, one per treatment level. Index 0 = control, indices 1…K−1 = treated levels.
truth_result (TruthResult) – Internal truth intermediate from compute_potential_outcomes.
metric_mode (MetricMode) – Metric configuration — provides metric_id for family derivation.
experiment_id (str) – Canonical experiment identifier stamped onto observed data.

Returns:

(observed, truth) — observed is validated and truth-free.

Return type:

CalibrationBundle

Raises:

SchemaViolation – If observed data violates the visitor schema contract.
AlignmentViolation – If the truth array length doesn’t match total observed visitors.
ValueError – If alignment check fails at the n_visitors level.

pytyche.generators.core.generate_v2_core(config)[source]¶

Generate a CalibrationBundle via the v2 potential-outcomes pipeline.

Pipeline stages:

sample_features — draw correlated mixed covariates X.
compute_potential_outcomes — derive per-visitor truth arrays.
assign_and_observe — assign treatment, sample observed outcomes.
build_bundle — assemble CalibrationBundle(observed, truth).

Parameters:: config (V2GeneratorConfig) – Fully specified v2 generator configuration.
Returns:: (observed, truth) — observed is truth-free, truth contains per-visitor CATE aligned with concatenated visitor rows.
Return type:: CalibrationBundle