Marginals

class synthyverse.evaluation.fidelity.Marginals(discrete_features=[], n_bins_numerical=20)

Bases: object

Registry name: marginals

Per-column distributional distance between real and synthetic marginals.

Computes distance metrics for each column independently and returns the average distances over numerical and categorical features separately. Numerical distance functions: Wasserstein (wsd), Jensen-Shannon divergence (jsd), Kolmogorov-Smirnov statistic (ks), Total Variation distance (tvd), and Kullback-Leibler divergence (kld). Categorical distance functions: Jensen-Shannon divergence (jsd), Total Variation distance (tvd), and Kullback-Leibler divergence (kld). For histogram-based metrics (jsd, tvd, kld) on numerical features, values are discretized into equal-width bins before comparison.

Lower scores indicate better fidelity to the real marginals.

Parameters:
  • discrete_features (list) – List of discrete/categorical feature names. Default: [].

  • n_bins_numerical (int) – Number of equal-width bins used when discretizing numerical features for jsd/tvd/kld. Must be >= 2. Default: 20.

Example

>>> import pandas as pd
>>> from synthyverse.evaluation import Marginals
>>>
>>> # Prepare data
>>> X_real = pd.DataFrame(...)
>>> X_syn = pd.DataFrame(...)
>>> discrete_features = ["category_col"]
>>>
>>> # Create metric
>>> metric = Marginals(
...     discrete_features=discrete_features,
... )
>>>
>>> # Evaluate
>>> results = metric.evaluate(X_real, X_syn)
evaluate(X_train, X_syn)

Evaluate synthetic data by comparing marginal distributions.

Parameters:
  • X_train (DataFrame) – Real training data as a pandas DataFrame.

  • X_syn (DataFrame) – Synthetic data as a pandas DataFrame.

Returns:

Dictionary with keys:
  • ”marginals.num_<distance>”: Mean distance over numerical features

  • ”marginals.cat_<distance>”: Mean distance over categorical features

Return type:

dict