Tabular Synthesis

class synthyverse.benchmark.synthesis.TabularSynthesisBenchmark(X, save_dir, generator, generator_params, categorical_features, target_column, random_state=42, constraints=None, missing_imputation_method='drop', monitor_memory=False, reuse_schema=True, reuse_processors=True, max_eval_samples=None)

Bases: object

Benchmark a tabular synthetic data generator on a real dataset.

TabularSynthesisBenchmark manages the complete workflow for comparing a generator against real tabular data. It creates reproducible train, validation, and test splits, preprocesses the data, trains or loads the configured generator, samples synthetic datasets, and evaluates metrics.

The benchmark can be run in two ways:

  • call run() to train and evaluate in one step;

  • call train() and eval() separately to reuse saved models or run different metrics later.

For each train seed, the benchmark stores the fitted data processor and the trained generator under save_dir. Dataset-level schema information is stored once and can be reused across runs. Results are returned as a long-format pandas.DataFrame and can also be written to CSV.

Parameters:
  • X (pd.DataFrame) – Real tabular dataset to benchmark on.

  • save_dir (str or Path) – Directory used for benchmark artifacts. Models are saved under save_dir/models and preprocessing artifacts are saved under save_dir/processors.

  • generator (str) – Name of the generator to benchmark. The name is resolved with get_generator.

  • generator_params (dict) – Keyword arguments used to initialize the generator.

  • categorical_features (list) – Names of categorical/discrete columns in the original dataset.

  • target_column (str or None) – Column used for stratified splits and supervised metrics. Use None when there is no target column.

  • random_state (int) – First seed used for train splits and synthetic set sampling. Additional train seeds are consecutive integers starting from this value. Default: 42.

  • constraints (list or str, optional) – Optional data constraints passed to DataProcessor. Default: None.

  • missing_imputation_method (str) – Missing-value strategy passed to DataProcessor. Options include "drop", "keep", "mean", "median", "most_frequent", and "missforest". Default: "drop".

  • monitor_memory (bool) – Whether to record peak CPU and, when available, CUDA memory usage during training and sampling. Default: False.

  • reuse_schema (bool) – Whether to reuse a saved dataset schema from save_dir when available. Default: True.

  • reuse_processors (bool) – Whether to reuse saved preprocessing artifacts for each train seed when available. Default: True.

  • max_eval_samples (int or None) – Default maximum number of rows per real and synthetic dataset passed to evaluation metrics. Real train, test, and validation datasets are subsampled per synthetic set with the same varying seed used for sampling synthetic data. Synthetic train and test datasets are generated directly at the capped size. Default: None.

Example

>>> import pandas as pd
>>> from synthyverse.benchmark.synthesis import TabularSynthesisBenchmark
>>>
>>> X = pd.read_csv("data/cohort.csv")
>>> benchmark = TabularSynthesisBenchmark(
...     X=X,
...     save_dir="dataset/bn",
...     generator="bn",
...     generator_params={"struct_max_indegree": 2},
...     categorical_features=["sex", "mortality"],
...     target_column="mortality",
...     random_state=42,
... )
>>>
>>> benchmark.train(n_train_seeds=3)
>>> results = benchmark.eval(metrics=["wasserstein", "dcr"], n_train_seeds=3)

Train now and evaluate later by creating a new benchmark with the same dataset, generator settings, and save_dir:

>>> benchmark = TabularSynthesisBenchmark(
...     X=X,
...     save_dir="dataset/bn",
...     generator="bn",
...     generator_params={"struct_max_indegree": 2},
...     categorical_features=["sex", "mortality"],
...     target_column="mortality",
...     random_state=42,
... )
>>> benchmark.train(n_train_seeds=3)
>>>
>>> reloaded_benchmark = TabularSynthesisBenchmark(
...     X=X,
...     save_dir="dataset/bn",
...     generator="bn",
...     generator_params={"struct_max_indegree": 2},
...     categorical_features=["sex", "mortality"],
...     target_column="mortality",
...     random_state=42,
... )
>>> results = reloaded_benchmark.eval(
...     metrics=["wasserstein", "dcr"],
...     n_train_seeds=3,
... )
eval(metrics, n_train_seeds=1, n_sets=1, test_size=0.2, val_size=0.2, full_determinism=False, results_save_path='results', append_results=True, max_eval_samples=None)

Evaluate saved generators against real train and test data.

For each train seed, this method recreates the data split, loads the saved DataProcessor and generator, samples one or more synthetic train/test dataset pairs, postprocesses the samples back to the original data representation, and evaluates the requested metrics.

Call train() before using this method unless compatible saved models already exist in save_dir. The test_size and val_size values should match the values used for training.

Parameters:
  • metrics (dict or list) – Metrics passed to TabularMetricEvaluator.

  • n_train_seeds (int) – Number of consecutive train split seeds to evaluate. Matching saved models must exist for these seeds. Default: 1.

  • n_sets (int) – Number of synthetic datasets to sample per saved generator. Default: 1.

  • test_size (float) – Fraction of data reserved for testing. Default: 0.2.

  • val_size (float) – Fraction of data reserved for validation. This is interpreted as a fraction of the full dataset. Use 0 when the models were trained without validation data. Default: 0.2.

  • full_determinism (bool) – Whether to request stricter deterministic behavior from set_seed. Default: False.

  • results_save_path (str or Path) – Directory or CSV path for evaluation results. If a directory is provided, results are written to <results_save_path>/<generator>.csv. Default: "results".

  • append_results (bool) – Whether to append evaluation rows to an existing result CSV. Default: True.

  • max_eval_samples (int or None) – Optional per-call override for the benchmark’s default evaluation sample cap. Default: None.

Returns:

Evaluation result rows with sampling time, optional memory usage, and metric values.

Return type:

pd.DataFrame

Example

>>> results = benchmark.eval(
...     metrics=["wasserstein", "dcr"],
...     n_train_seeds=3,
...     n_sets=2,
... )
run(metrics, n_train_seeds=1, n_sets=1, test_size=0.2, val_size=0.2, full_determinism=False, results_save_path='results', max_eval_samples=None)

Train generators and evaluate them in one benchmark run.

This convenience method calls train() followed by eval() using the same split settings.

Parameters:
  • metrics (dict or list) – Metrics passed to TabularMetricEvaluator.

  • n_train_seeds (int) – Number of consecutive train split seeds to train and evaluate. Default: 1.

  • n_sets (int) – Number of synthetic datasets to sample per saved generator. Default: 1.

  • test_size (float) – Fraction of data reserved for testing. Default: 0.2.

  • val_size (float) – Fraction of data reserved for validation. This is interpreted as a fraction of the full dataset. Use 0 to skip validation data. Default: 0.2.

  • full_determinism (bool) – Whether to request stricter deterministic behavior from set_seed. Default: False.

  • results_save_path (str or Path) – Directory or CSV path for all benchmark results. If a directory is provided, results are written to <results_save_path>/<generator>.csv. Default: "results".

  • max_eval_samples (int or None) – Optional per-call override for the benchmark’s default evaluation sample cap. Default: None.

Returns:

Combined training and evaluation result rows.

Return type:

pd.DataFrame

Example

>>> metrics = ["wasserstein", "dcr"]
>>> results = benchmark.run(metrics=metrics, n_train_seeds=3, n_sets=2)
train(n_train_seeds=1, test_size=0.2, val_size=0.2, full_determinism=False, results_save_path='results', write_results=True, append_results=False)

Train generators for one or more reproducible data splits.

For each train seed, this method creates the train, validation, and test split, fits or loads the corresponding DataProcessor, trains the configured generator, and saves the trained generator under save_dir/models.

Use this method when you want to train models now and evaluate them later with eval(). The same test_size and val_size should be used during evaluation so that the saved artifacts match the recreated splits.

Parameters:
  • n_train_seeds (int) – Number of consecutive train split seeds to train. Default: 1.

  • test_size (float) – Fraction of data reserved for testing. Default: 0.2.

  • val_size (float) – Fraction of data reserved for validation. This is interpreted as a fraction of the full dataset. Use 0 to skip validation data. Default: 0.2.

  • full_determinism (bool) – Whether to request stricter deterministic behavior from set_seed. Default: False.

  • results_save_path (str or Path) – Directory or CSV path for training results. If a directory is provided, results are written to <results_save_path>/<generator>.csv. Default: "results".

  • write_results (bool) – Whether to write training results to CSV. Default: True.

  • append_results (bool) – Whether to append to an existing result CSV. Default: False.

Returns:

Training result rows with columns metric name, metric value, train_seed, and set.

Return type:

pd.DataFrame

Example

>>> train_results = benchmark.train(n_train_seeds=3)