MLE¶
- class synthyverse.evaluation.utility.MLE(target_column='target', discrete_features=None, random_state=0, train_set='synthetic', model_name='xgboost', model_params=None, tune=False, tuning_trials=32)¶
Bases:
objectRegistry name:
mleMachine Learning Efficacy from configurable ML models.
Measures how well synthetic data can be used for downstream machine learning tasks compared to real data.
- Parameters:
target_column (str) – Name of the target column. Default: “target”.
discrete_features (list) – List of discrete/categorical feature names. Default: [].
random_state (int) – Random seed for reproducibility. Default: 0.
train_set (str) – Which dataset to train on (“synthetic” for TSTR, “real” for TRTS). Evaluates on the opposite set. Default: “synthetic”.
model_name (str) – Estimator family. Supported values include “xgboost”, “randomforest”, “decisiontree”, “linearregression”, and “svm”, including common aliases. Every model except for XGBoost is a scikit-learn model. Default: “xgboost”.
model_params (dict) – Model parameters passed to the selected estimator.
tune (bool) – Whether to tune hyperparameters. Default: False.
tuning_trials (int) – Number of Optuna trials for hyperparameter tuning. Default: 32.
Example
>>> import pandas as pd >>> from synthyverse.evaluation import MLE >>> >>> # Prepare data >>> X_train = pd.DataFrame(...) >>> X_test = pd.DataFrame(...) >>> X_syn = pd.DataFrame(...) >>> X_val = pd.DataFrame(...) >>> discrete_features = ["category_col"] >>> >>> # Create metric >>> metric = MLE( ... target_column="target", ... discrete_features=discrete_features, ... train_set="synthetic", ... tune=True, ... random_state=42 ... ) >>> >>> # Evaluate >>> results = metric.evaluate(X_train, X_test, X_syn, X_val=X_val)
- evaluate(X_train, X_test, X_syn, X_val=None, X_syn_test=None)¶
Evaluate synthetic data utility using machine learning efficacy.
- Parameters:
X_train (
DataFrame) – Real training data as a pandas DataFrame.X_test (
DataFrame) – Real test data as a pandas DataFrame.X_syn (
DataFrame) – Synthetic training data as a pandas DataFrame.X_val (
Optional[DataFrame]) – Optional validation data used when tune=True.X_syn_test (
Optional[DataFrame]) – Optional synthetic test data used when train_set=”real”.
- Returns:
- Dictionary with metric scores for the configured train/test
direction. Keys have the form “mle.train_<train_set>_test_<test_set>.<score>”.
- Return type:
dict