Marginals¶
- class synthyverse.evaluation.fidelity.Marginals(discrete_features=[], n_bins_numerical=20)¶
Bases:
objectRegistry name:
marginalsPer-column distributional distance between real and synthetic marginals.
Computes distance metrics for each column independently and returns the average distances over numerical and categorical features separately. Numerical distance functions: Wasserstein (wsd), Jensen-Shannon divergence (jsd), Kolmogorov-Smirnov statistic (ks), Total Variation distance (tvd), and Kullback-Leibler divergence (kld). Categorical distance functions: Jensen-Shannon divergence (jsd), Total Variation distance (tvd), and Kullback-Leibler divergence (kld). For histogram-based metrics (jsd, tvd, kld) on numerical features, values are discretized into equal-width bins before comparison.
Lower scores indicate better fidelity to the real marginals.
- Parameters:
discrete_features (list) – List of discrete/categorical feature names. Default: [].
n_bins_numerical (int) – Number of equal-width bins used when discretizing numerical features for jsd/tvd/kld. Must be >= 2. Default: 20.
Example
>>> import pandas as pd >>> from synthyverse.evaluation import Marginals >>> >>> # Prepare data >>> X_real = pd.DataFrame(...) >>> X_syn = pd.DataFrame(...) >>> discrete_features = ["category_col"] >>> >>> # Create metric >>> metric = Marginals( ... discrete_features=discrete_features, ... ) >>> >>> # Evaluate >>> results = metric.evaluate(X_real, X_syn)
- evaluate(X_train, X_syn)¶
Evaluate synthetic data by comparing marginal distributions.
- Parameters:
X_train (
DataFrame) – Real training data as a pandas DataFrame.X_syn (
DataFrame) – Synthetic data as a pandas DataFrame.
- Returns:
- Dictionary with keys:
”marginals.num_<distance>”: Mean distance over numerical features
”marginals.cat_<distance>”: Mean distance over categorical features
- Return type:
dict