Tabular Synthesis

class synthyverse.benchmark.synthesis.TabularSynthesisBenchmark(generator='arf', generator_params=None, n_random_splits=1, n_inits=1, test_size=0.2, val_size=0.1, missing_imputation_method='drop', retain_missingness=False, constraints=None, workspace='workspace', random_state=0)[source]

Bases: object

Benchmark for evaluating tabular synthetic data generators.

Parameters:
  • generator (Union[str, TabularBaseGenerator]) – Generator identifier. Can be a synthyverse generator name or a custom generator instance.

  • generator_params (dict) – Dictionary of generator-specific parameters. Default: None (empty dict).

  • n_random_splits (int) – Number of random train/test splits to evaluate. Default: 1.

  • n_inits (int) – Number of generator training initializations per split. Default: 1.

  • test_size (float) – Proportion of data to use for testing (0.0 to 1.0). Default: 0.2.

  • val_size (float) – Proportion of data to use for validation (0.0 to 1.0). Set to 0.0 to disable the validation split. Note that val_size+test_size must be < 1.0. Default: 0.1.

  • missing_imputation_method (str) – Method for handling missing values. “drop” removes missing rows, other options perform imputation: “random”, “mean”, “median”, “most_frequent”, “missforest”. Default: “drop”.

  • retain_missingness (bool) – Whether to retain missing values in generated datasets. Default: False.

  • constraints (Union[str, list]) – List of constraint strings which should hold in the generated data. Note that the constraints should already hold in the training datasets. Default: None (empty list).

  • workspace (str) – Directory for storing intermediate files. Default: “workspace”.

  • random_state (int)

Example

>>> import pandas as pd
>>> from synthyverse.benchmark import TabularSynthesisBenchmark
>>>
>>> # Load your data
>>> X = pd.read_csv("data.csv")
>>> discrete_columns = ["category_col"]
>>> target_column = "target"
>>>
>>> # Create benchmark
>>> benchmark = TabularSynthesisBenchmark(
...     generator="arf",
...     generator_params={"num_trees": 50},
...     n_random_splits=3,
...     n_inits=3
... )
>>>
>>> # Train and evaluate models
>>> trained_models = benchmark.train(X, target_column, discrete_columns)
>>> results = benchmark.eval(
...     X,
...     trained_models,
...     metrics=["classifier_test", "mle", "dcr"],
...     n_generated_datasets=1,
... )
>>> # Or, train and evaluate models in one step:
>>> results, trained_models = benchmark.train_and_eval(X, target_column, discrete_columns)
>>> results
eval(X, trained_models, metrics=None, n_generated_datasets=1, max_eval_size=1000000000, result_format='frame')[source]

Evaluate trained model objects.

Parameters:
  • X (DataFrame) – Full dataset as a pandas DataFrame.

  • trained_models (Union[TabularBaseGenerator, dict]) – A single trained model or nested split/init dict returned by this benchmark’s train().

  • metrics (Union[list, dict, None]) – List or dictionary of metrics to evaluate. Defaults to [“classifier_test”, “mle”, “dcr”] when None.

  • n_generated_datasets (int) – Number of synthetic datasets to generate per initialization.

  • max_eval_size (int) – Maximum size of sampled train/test/validation subsets used for evaluation.

  • result_format (str) – Format of results (“frame” for DataFrame, “dict” for nested dict).

Returns:

Benchmark results in the specified format.

Return type:

pd.DataFrame or dict

train(X, target_column, discrete_columns)[source]

Train the configured generator and return trained model objects.

Parameters:
  • X (DataFrame) – Full dataset as a pandas DataFrame.

  • target_column (str) – Name of the target column.

  • discrete_columns (list) – List of discrete/categorical column names.

Returns:

Trained model, or nested split/init dict of trained models.

Return type:

TabularBaseGenerator or dict

train_and_eval(X, target_column, discrete_columns, metrics=None, n_generated_datasets=1, max_eval_size=1000000000, result_format='frame')[source]

Train and evaluate the generator.

This is a convenience wrapper for users who only need benchmark results and do not need to call train() and eval() separately.

Parameters:
  • X (DataFrame) – Full dataset as a pandas DataFrame.

  • target_column (str) – Name of the target column.

  • discrete_columns (list) – List of discrete/categorical column names.

  • metrics (Union[list, dict, None]) – List or dictionary of metrics to evaluate. Defaults to [“classifier_test”, “mle”, “dcr”] when None.

  • n_generated_datasets (int) – Number of synthetic datasets to generate per initialization.

  • max_eval_size (int) – Maximum size of sampled train/test/validation subsets used for evaluation.

  • result_format (str) – Format of results (“frame” for DataFrame, “dict” for nested dict).

Returns:

(results, trained_models), where results is in the requested result_format, and trained_models is the output from train().

Return type:

tuple