Utility Metrics

class synthyverse.evaluation.utility.MLE(X_val=None, target_column='target', discrete_features=None, random_state=0, train_set='synthetic', model_name='xgboost', model_params=None, tune=False, tuning_trials=32)[source]

Bases: object

Machine Learning Efficacy from configurable ML models.

Measures how well synthetic data can be used for downstream machine learning tasks compared to real data.

Parameters:
  • X_val (pd.DataFrame, optional) – Validation data for hyperparameter tuning. Default: None.

  • target_column (str) – Name of the target column. Default: “target”.

  • discrete_features (list) – List of discrete/categorical feature names. Default: [].

  • random_state (int) – Random seed for reproducibility. Default: 0.

  • train_set (str) – Which dataset to train on (“synthetic” for TSTR, “real” for TRTS). Default: “synthetic”.

  • model_name (str) – Estimator name. Use “xgboost” for native XGBoost, or any sklearn estimator class name discoverable via sklearn.utils.discovery.all_estimators.

  • model_params (dict) – Model parameters passed to the selected estimator.

  • tune (bool) – Whether to tune hyperparameters. Default: False.

  • tuning_trials (int) – Number of Optuna trials for hyperparameter tuning. Default: 32.

Example

>>> import pandas as pd
>>> from synthyverse.evaluation import MLE
>>>
>>> # Prepare data
>>> X_train = pd.DataFrame(...)
>>> X_test = pd.DataFrame(...)
>>> X_syn = pd.DataFrame(...)
>>> X_val = pd.DataFrame(...)
>>> discrete_features = ["category_col"]
>>>
>>> # Create metric
>>> metric = MLE(
...     X_val=X_val,
...     target_column="target",
...     discrete_features=discrete_features,
...     train_set="synthetic",
...     tune=True,
...     random_state=42
... )
>>>
>>> # Evaluate
>>> results = metric.evaluate(X_train, X_test, X_syn)
evaluate(train, test, sd)[source]

Evaluate synthetic data utility using machine learning efficacy.

Parameters:
  • train (DataFrame) – Training data as a pandas DataFrame.

  • test (DataFrame) – Test data as a pandas DataFrame.

  • sd (DataFrame) – Synthetic data as a pandas DataFrame.

Returns:

Dictionary with MLE metric scores. Includes both synthetic-to-real

and real-to-real baseline scores.

Return type:

dict