Tabular Synthesis¶
- class synthyverse.benchmark.synthesis.TabularSynthesisBenchmark(X, save_dir, generator, generator_params, categorical_features, target_column, random_state=42, constraints=None, missing_imputation_method='drop', monitor_memory=False, reuse_schema=True, reuse_processors=True, max_eval_samples=None)¶
Bases:
objectBenchmark a tabular synthetic data generator on a real dataset.
TabularSynthesisBenchmarkmanages the complete workflow for comparing a generator against real tabular data. It creates reproducible train, validation, and test splits, preprocesses the data, trains or loads the configured generator, samples synthetic datasets, and evaluates metrics.The benchmark can be run in two ways:
call
run()to train and evaluate in one step;call
train()andeval()separately to reuse saved models or run different metrics later.
For each train seed, the benchmark stores the fitted data processor and the trained generator under
save_dir. Dataset-level schema information is stored once and can be reused across runs. Results are returned as a long-formatpandas.DataFrameand can also be written to CSV.- Parameters:
X (pd.DataFrame) – Real tabular dataset to benchmark on.
save_dir (str or Path) – Directory used for benchmark artifacts. Models are saved under
save_dir/modelsand preprocessing artifacts are saved undersave_dir/processors.generator (str) – Name of the generator to benchmark. The name is resolved with
get_generator.generator_params (dict) – Keyword arguments used to initialize the generator.
categorical_features (list) – Names of categorical/discrete columns in the original dataset.
target_column (str or None) – Column used for stratified splits and supervised metrics. Use
Nonewhen there is no target column.random_state (int) – First seed used for train splits and synthetic set sampling. Additional train seeds are consecutive integers starting from this value. Default: 42.
constraints (list or str, optional) – Optional data constraints passed to
DataProcessor. Default: None.missing_imputation_method (str) – Missing-value strategy passed to
DataProcessor. Options include"drop","keep","mean","median","most_frequent", and"missforest". Default:"drop".monitor_memory (bool) – Whether to record peak CPU and, when available, CUDA memory usage during training and sampling. Default: False.
reuse_schema (bool) – Whether to reuse a saved dataset schema from
save_dirwhen available. Default: True.reuse_processors (bool) – Whether to reuse saved preprocessing artifacts for each train seed when available. Default: True.
max_eval_samples (int or None) – Default maximum number of rows per real and synthetic dataset passed to evaluation metrics. Real train, test, and validation datasets are subsampled per synthetic set with the same varying seed used for sampling synthetic data. Synthetic train and test datasets are generated directly at the capped size. Default: None.
Example
>>> import pandas as pd >>> from synthyverse.benchmark.synthesis import TabularSynthesisBenchmark >>> >>> X = pd.read_csv("data/cohort.csv") >>> benchmark = TabularSynthesisBenchmark( ... X=X, ... save_dir="dataset/bn", ... generator="bn", ... generator_params={"struct_max_indegree": 2}, ... categorical_features=["sex", "mortality"], ... target_column="mortality", ... random_state=42, ... ) >>> >>> benchmark.train(n_train_seeds=3) >>> results = benchmark.eval(metrics=["wasserstein", "dcr"], n_train_seeds=3)
Train now and evaluate later by creating a new benchmark with the same dataset, generator settings, and
save_dir:>>> benchmark = TabularSynthesisBenchmark( ... X=X, ... save_dir="dataset/bn", ... generator="bn", ... generator_params={"struct_max_indegree": 2}, ... categorical_features=["sex", "mortality"], ... target_column="mortality", ... random_state=42, ... ) >>> benchmark.train(n_train_seeds=3) >>> >>> reloaded_benchmark = TabularSynthesisBenchmark( ... X=X, ... save_dir="dataset/bn", ... generator="bn", ... generator_params={"struct_max_indegree": 2}, ... categorical_features=["sex", "mortality"], ... target_column="mortality", ... random_state=42, ... ) >>> results = reloaded_benchmark.eval( ... metrics=["wasserstein", "dcr"], ... n_train_seeds=3, ... )
- eval(metrics, n_train_seeds=1, n_sets=1, test_size=0.2, val_size=0.2, full_determinism=False, results_save_path='results', append_results=True, max_eval_samples=None)¶
Evaluate saved generators against real train and test data.
For each train seed, this method recreates the data split, loads the saved
DataProcessorand generator, samples one or more synthetic train/test dataset pairs, postprocesses the samples back to the original data representation, and evaluates the requested metrics.Call
train()before using this method unless compatible saved models already exist insave_dir. Thetest_sizeandval_sizevalues should match the values used for training.- Parameters:
metrics (dict or list) – Metrics passed to
TabularMetricEvaluator.n_train_seeds (int) – Number of consecutive train split seeds to evaluate. Matching saved models must exist for these seeds. Default: 1.
n_sets (int) – Number of synthetic datasets to sample per saved generator. Default: 1.
test_size (float) – Fraction of data reserved for testing. Default: 0.2.
val_size (float) – Fraction of data reserved for validation. This is interpreted as a fraction of the full dataset. Use
0when the models were trained without validation data. Default: 0.2.full_determinism (bool) – Whether to request stricter deterministic behavior from
set_seed. Default: False.results_save_path (str or Path) – Directory or CSV path for evaluation results. If a directory is provided, results are written to
<results_save_path>/<generator>.csv. Default:"results".append_results (bool) – Whether to append evaluation rows to an existing result CSV. Default: True.
max_eval_samples (int or None) – Optional per-call override for the benchmark’s default evaluation sample cap. Default: None.
- Returns:
Evaluation result rows with sampling time, optional memory usage, and metric values.
- Return type:
pd.DataFrame
Example
>>> results = benchmark.eval( ... metrics=["wasserstein", "dcr"], ... n_train_seeds=3, ... n_sets=2, ... )
- run(metrics, n_train_seeds=1, n_sets=1, test_size=0.2, val_size=0.2, full_determinism=False, results_save_path='results', max_eval_samples=None)¶
Train generators and evaluate them in one benchmark run.
This convenience method calls
train()followed byeval()using the same split settings.- Parameters:
metrics (dict or list) – Metrics passed to
TabularMetricEvaluator.n_train_seeds (int) – Number of consecutive train split seeds to train and evaluate. Default: 1.
n_sets (int) – Number of synthetic datasets to sample per saved generator. Default: 1.
test_size (float) – Fraction of data reserved for testing. Default: 0.2.
val_size (float) – Fraction of data reserved for validation. This is interpreted as a fraction of the full dataset. Use
0to skip validation data. Default: 0.2.full_determinism (bool) – Whether to request stricter deterministic behavior from
set_seed. Default: False.results_save_path (str or Path) – Directory or CSV path for all benchmark results. If a directory is provided, results are written to
<results_save_path>/<generator>.csv. Default:"results".max_eval_samples (int or None) – Optional per-call override for the benchmark’s default evaluation sample cap. Default: None.
- Returns:
Combined training and evaluation result rows.
- Return type:
pd.DataFrame
Example
>>> metrics = ["wasserstein", "dcr"] >>> results = benchmark.run(metrics=metrics, n_train_seeds=3, n_sets=2)
- train(n_train_seeds=1, test_size=0.2, val_size=0.2, full_determinism=False, results_save_path='results', write_results=True, append_results=False)¶
Train generators for one or more reproducible data splits.
For each train seed, this method creates the train, validation, and test split, fits or loads the corresponding
DataProcessor, trains the configured generator, and saves the trained generator undersave_dir/models.Use this method when you want to train models now and evaluate them later with
eval(). The sametest_sizeandval_sizeshould be used during evaluation so that the saved artifacts match the recreated splits.- Parameters:
n_train_seeds (int) – Number of consecutive train split seeds to train. Default: 1.
test_size (float) – Fraction of data reserved for testing. Default: 0.2.
val_size (float) – Fraction of data reserved for validation. This is interpreted as a fraction of the full dataset. Use
0to skip validation data. Default: 0.2.full_determinism (bool) – Whether to request stricter deterministic behavior from
set_seed. Default: False.results_save_path (str or Path) – Directory or CSV path for training results. If a directory is provided, results are written to
<results_save_path>/<generator>.csv. Default:"results".write_results (bool) – Whether to write training results to CSV. Default: True.
append_results (bool) – Whether to append to an existing result CSV. Default: False.
- Returns:
Training result rows with columns
metric name,metric value,train_seed, andset.- Return type:
pd.DataFrame
Example
>>> train_results = benchmark.train(n_train_seeds=3)