MLE

class synthyverse.evaluation.utility.MLE(target_column='target', discrete_features=None, random_state=0, train_set='synthetic', model_name='xgboost', model_params=None, tune=False, tuning_trials=32)

Bases: object

Registry name: mle

Machine Learning Efficacy from configurable ML models.

Measures how well synthetic data can be used for downstream machine learning tasks compared to real data.

Parameters:
  • target_column (str) – Name of the target column. Default: “target”.

  • discrete_features (list) – List of discrete/categorical feature names. Default: [].

  • random_state (int) – Random seed for reproducibility. Default: 0.

  • train_set (str) – Which dataset to train on (“synthetic” for TSTR, “real” for TRTS). Evaluates on the opposite set. Default: “synthetic”.

  • model_name (str) – Estimator family. Supported values include “xgboost”, “randomforest”, “decisiontree”, “linearregression”, and “svm”, including common aliases. Every model except for XGBoost is a scikit-learn model. Default: “xgboost”.

  • model_params (dict) – Model parameters passed to the selected estimator.

  • tune (bool) – Whether to tune hyperparameters. Default: False.

  • tuning_trials (int) – Number of Optuna trials for hyperparameter tuning. Default: 32.

Example

>>> import pandas as pd
>>> from synthyverse.evaluation import MLE
>>>
>>> # Prepare data
>>> X_train = pd.DataFrame(...)
>>> X_test = pd.DataFrame(...)
>>> X_syn = pd.DataFrame(...)
>>> X_val = pd.DataFrame(...)
>>> discrete_features = ["category_col"]
>>>
>>> # Create metric
>>> metric = MLE(
...     target_column="target",
...     discrete_features=discrete_features,
...     train_set="synthetic",
...     tune=True,
...     random_state=42
... )
>>>
>>> # Evaluate
>>> results = metric.evaluate(X_train, X_test, X_syn, X_val=X_val)
evaluate(X_train, X_test, X_syn, X_val=None, X_syn_test=None)

Evaluate synthetic data utility using machine learning efficacy.

Parameters:
  • X_train (DataFrame) – Real training data as a pandas DataFrame.

  • X_test (DataFrame) – Real test data as a pandas DataFrame.

  • X_syn (DataFrame) – Synthetic training data as a pandas DataFrame.

  • X_val (Optional[DataFrame]) – Optional validation data used when tune=True.

  • X_syn_test (Optional[DataFrame]) – Optional synthetic test data used when train_set=”real”.

Returns:

Dictionary with metric scores for the configured train/test

direction. Keys have the form “mle.train_<train_set>_test_<test_set>.<score>”.

Return type:

dict