Tabular Synthesis¶
- class synthyverse.benchmark.synthesis.TabularSynthesisBenchmark(generator='arf', generator_params=None, n_random_splits=1, n_inits=1, test_size=0.2, val_size=0.1, missing_imputation_method='drop', retain_missingness=False, constraints=None, workspace='workspace', random_state=0)[source]¶
Bases:
objectBenchmark for evaluating tabular synthetic data generators.
- Parameters:
generator (Union[str, TabularBaseGenerator]) – Generator identifier. Can be a synthyverse generator name or a custom generator instance.
generator_params (dict) – Dictionary of generator-specific parameters. Default: None (empty dict).
n_random_splits (int) – Number of random train/test splits to evaluate. Default: 1.
n_inits (int) – Number of generator training initializations per split. Default: 1.
test_size (float) – Proportion of data to use for testing (0.0 to 1.0). Default: 0.2.
val_size (float) – Proportion of data to use for validation (0.0 to 1.0). Set to 0.0 to disable the validation split. Note that val_size+test_size must be < 1.0. Default: 0.1.
missing_imputation_method (str) – Method for handling missing values. “drop” removes missing rows, other options perform imputation: “random”, “mean”, “median”, “most_frequent”, “missforest”. Default: “drop”.
retain_missingness (bool) – Whether to retain missing values in generated datasets. Default: False.
constraints (Union[str, list]) – List of constraint strings which should hold in the generated data. Note that the constraints should already hold in the training datasets. Default: None (empty list).
workspace (str) – Directory for storing intermediate files. Default: “workspace”.
random_state (int)
Example
>>> import pandas as pd >>> from synthyverse.benchmark import TabularSynthesisBenchmark >>> >>> # Load your data >>> X = pd.read_csv("data.csv") >>> discrete_columns = ["category_col"] >>> target_column = "target" >>> >>> # Create benchmark >>> benchmark = TabularSynthesisBenchmark( ... generator="arf", ... generator_params={"num_trees": 50}, ... n_random_splits=3, ... n_inits=3 ... ) >>> >>> # Train and evaluate models >>> trained_models = benchmark.train(X, target_column, discrete_columns) >>> results = benchmark.eval( ... X, ... trained_models, ... metrics=["classifier_test", "mle", "dcr"], ... n_generated_datasets=1, ... )
>>> # Or, train and evaluate models in one step: >>> results, trained_models = benchmark.train_and_eval(X, target_column, discrete_columns) >>> results
- eval(X, trained_models, metrics=None, n_generated_datasets=1, max_eval_size=1000000000, result_format='frame')[source]¶
Evaluate trained model objects.
- Parameters:
X (
DataFrame) – Full dataset as a pandas DataFrame.trained_models (
Union[TabularBaseGenerator,dict]) – A single trained model or nested split/init dict returned by this benchmark’s train().metrics (
Union[list,dict,None]) – List or dictionary of metrics to evaluate. Defaults to [“classifier_test”, “mle”, “dcr”] when None.n_generated_datasets (
int) – Number of synthetic datasets to generate per initialization.max_eval_size (
int) – Maximum size of sampled train/test/validation subsets used for evaluation.result_format (
str) – Format of results (“frame” for DataFrame, “dict” for nested dict).
- Returns:
Benchmark results in the specified format.
- Return type:
pd.DataFrame or dict
- train(X, target_column, discrete_columns)[source]¶
Train the configured generator and return trained model objects.
- Parameters:
X (
DataFrame) – Full dataset as a pandas DataFrame.target_column (
str) – Name of the target column.discrete_columns (
list) – List of discrete/categorical column names.
- Returns:
Trained model, or nested split/init dict of trained models.
- Return type:
TabularBaseGenerator or dict
- train_and_eval(X, target_column, discrete_columns, metrics=None, n_generated_datasets=1, max_eval_size=1000000000, result_format='frame')[source]¶
Train and evaluate the generator.
This is a convenience wrapper for users who only need benchmark results and do not need to call train() and eval() separately.
- Parameters:
X (
DataFrame) – Full dataset as a pandas DataFrame.target_column (
str) – Name of the target column.discrete_columns (
list) – List of discrete/categorical column names.metrics (
Union[list,dict,None]) – List or dictionary of metrics to evaluate. Defaults to [“classifier_test”, “mle”, “dcr”] when None.n_generated_datasets (
int) – Number of synthetic datasets to generate per initialization.max_eval_size (
int) – Maximum size of sampled train/test/validation subsets used for evaluation.result_format (
str) – Format of results (“frame” for DataFrame, “dict” for nested dict).
- Returns:
(results, trained_models), where results is in the requested result_format, and trained_models is the output from train().
- Return type:
tuple