Fidelity Metrics¶
- class synthyverse.evaluation.fidelity.ClassifierTest(X_val=None, discrete_features=None, random_state=0, model_name='xgboost', clf_params=None, tune=False, tuning_trials=32)[source]¶
Bases:
objectAUC score of a classifier that distinguishes synthetic from real data.
Lower scores indicate better quality synthetic data (harder to distinguish from real).
- Parameters:
X_val (pd.DataFrame, optional) – Validation data for hyperparameter tuning. Default: None.
discrete_features (list) – List of discrete/categorical feature names. Default: [].
random_state (int) – Random seed for reproducibility. Default: 0.
model_name (str) – Classifier name. Use “xgboost” for native XGBoost, or any sklearn classifier class name discoverable via sklearn.utils.discovery.all_estimators.
clf_params (dict) – Classifier parameters passed to the selected estimator.
tune (bool) – Whether to tune hyperparameters. Default: False.
tuning_trials (int) – Number of Optuna trials for hyperparameter tuning. Default: 32.
Example
>>> import pandas as pd >>> from synthyverse.evaluation import ClassifierTest >>> >>> # Prepare data >>> X_train = pd.DataFrame(...) >>> X_test = pd.DataFrame(...) >>> X_syn = pd.DataFrame(...) >>> X_val = pd.DataFrame(...) >>> discrete_features = ["category_col"] >>> >>> # Create metric >>> metric = ClassifierTest( ... X_val=X_val, ... discrete_features=discrete_features, ... tune=True, ... random_state=42 ... ) >>> >>> # Evaluate >>> results = metric.evaluate(X_train, X_test, X_syn)
- evaluate(train, test, sd)[source]¶
Evaluate synthetic data using classifier test.
- Parameters:
train (
DataFrame) – Training data as a pandas DataFrame.test (
DataFrame) – Test data as a pandas DataFrame.sd (
DataFrame) – Synthetic data as a pandas DataFrame.
- Returns:
Dictionary with “classifiertest.auc” key and AUC score value.
- Return type:
dict
- class synthyverse.evaluation.fidelity.AlphaPrecisionBetaRecallAuthenticity(discrete_features=[])[source]¶
Bases:
objectAlpha-Precision, Beta-Recall, Authenticity score.
Paper: “How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models” by Alaa et al. (2022).
- Parameters:
discrete_features (list) – List of discrete/categorical feature names. Default: [].
Example
>>> import pandas as pd >>> from synthyverse.evaluation import AlphaPrecisionBetaRecallAuthenticity >>> >>> # Prepare data >>> X_real = pd.DataFrame(...) >>> X_syn = pd.DataFrame(...) >>> discrete_features = ["category_col"] >>> >>> # Create metric >>> metric = AlphaPrecisionBetaRecallAuthenticity( ... discrete_features=discrete_features ... ) >>> >>> # Evaluate >>> results = metric.evaluate(X_real, X_syn)
- evaluate(rd, sd)[source]¶
Evaluate synthetic data using alpha-precision, beta-recall, and authenticity.
- Parameters:
rd (
DataFrame) – Real data as a pandas DataFrame.sd (
DataFrame) – Synthetic data as a pandas DataFrame.
- Returns:
- Dictionary with keys:
”alphaprecision.naive.score”: Alpha-precision score
”betacoverage.naive.score”: Beta-coverage score
”authenticity.naive.score”: Authenticity score
- Return type:
dict
- class synthyverse.evaluation.fidelity.ShapeTrend(discrete_features=[])[source]¶
Bases:
objectColumn Shapes and Column Pair Trends from the SDMetrics library (https://docs.sdv.dev/sdmetrics/)
Indicates quality of marginal distributions and correlations in synthetic data, respectively.
- Parameters:
discrete_features (list) – List of discrete/categorical feature names. Default: [].
Example
>>> import pandas as pd >>> from synthyverse.evaluation import ShapeTrend >>> >>> # Prepare data >>> X_real = pd.DataFrame(...) >>> X_syn = pd.DataFrame(...) >>> discrete_features = ["category_col"] >>> >>> # Create metric >>> metric = ShapeTrend(discrete_features=discrete_features) >>> >>> # Evaluate >>> results = metric.evaluate(X_real, X_syn)
- evaluate(rd, sd)[source]¶
Evaluate synthetic data using SDMetrics shape and trend scores.
- Parameters:
rd (
DataFrame) – Real data as a pandas DataFrame.sd (
DataFrame) – Synthetic data as a pandas DataFrame.
- Returns:
- Dictionary with keys:
”shapetrend.shape”: Column shapes score
”shapetrend.trend”: Column pair trends score
- Return type:
dict
- class synthyverse.evaluation.fidelity.Marginals(discrete_features=[], numerical_distance='wsd', categorical_distance='jsd', n_bins_numerical=30)[source]¶
Bases:
objectPer-column distributional distance between real and synthetic marginals.
Computes a distance metric for each column independently and returns the average distance over numerical and categorical features separately. Supported distance functions: Wasserstein (wsd), Jensen-Shannon divergence (jsd), Kolmogorov-Smirnov statistic (ks), and Total Variation distance (tvd). For histogram-based metrics (jsd, tvd) on numerical features, values are discretized into equal-width bins before comparison.
Lower scores indicate better fidelity to the real marginals.
- Parameters:
discrete_features (list) – List of discrete/categorical feature names. Default: [].
numerical_distance (str) – Distance metric for numerical features. One of “wsd”, “jsd”, “ks”, or “tvd”. Default: “wsd”.
categorical_distance (str) – Distance metric for categorical features. One of “jsd”, “tvd”, “wsd”, or “ks”. Default: “jsd”.
n_bins_numerical (int) – Number of equal-width bins used when discretizing numerical features for jsd/tvd. Must be >= 2. Default: 30.
Example
>>> import pandas as pd >>> from synthyverse.evaluation import Marginals >>> >>> # Prepare data >>> X_real = pd.DataFrame(...) >>> X_syn = pd.DataFrame(...) >>> discrete_features = ["category_col"] >>> >>> # Create metric >>> metric = Marginals( ... discrete_features=discrete_features, ... numerical_distance="wsd", ... categorical_distance="jsd", ... ) >>> >>> # Evaluate >>> results = metric.evaluate(X_real, X_syn)
- evaluate(rd, sd)[source]¶
Evaluate synthetic data by comparing marginal distributions.
- Parameters:
rd (
DataFrame) – Real data as a pandas DataFrame.sd (
DataFrame) – Synthetic data as a pandas DataFrame.
- Returns:
- Dictionary with keys:
”marginals.<numerical_distance>”: Mean distance over numerical features
”marginals.<categorical_distance>”: Mean distance over categorical features
- Return type:
dict
- class synthyverse.evaluation.fidelity.Correlations(discrete_features=[], numerical_correlation='pearson')[source]¶
Bases:
objectPairwise correlation matrix difference between real and synthetic data.
Builds a full correlation matrix for both real and synthetic data and returns the L2 norm of their absolute difference. Correlation type is chosen automatically per feature pair: Spearman/Pearson for numerical-numerical, Cramer’s V for categorical-categorical, and the correlation ratio (eta-squared) for mixed pairs.
Lower scores indicate better preservation of feature dependencies.
- Parameters:
discrete_features (list) – List of discrete/categorical feature names. Default: [].
numerical_correlation (str) – Correlation method for numerical-numerical pairs. One of “spearman” or “pearson”. Default: “pearson”.
Example
>>> import pandas as pd >>> from synthyverse.evaluation import Correlations >>> >>> # Prepare data >>> X_real = pd.DataFrame(...) >>> X_syn = pd.DataFrame(...) >>> discrete_features = ["category_col"] >>> >>> # Create metric >>> metric = Correlations( ... discrete_features=discrete_features, ... numerical_correlation="spearman", ... ) >>> >>> # Evaluate >>> results = metric.evaluate(X_real, X_syn)
- evaluate(rd, sd)[source]¶
Evaluate synthetic data by comparing pairwise correlation matrices.
- Parameters:
rd (
DataFrame) – Real data as a pandas DataFrame.sd (
DataFrame) – Synthetic data as a pandas DataFrame.
- Returns:
- Dictionary with key:
”correlations.l2”: L2 norm of the absolute difference between the real and synthetic correlation matrices
- Return type:
dict