In-Depth Usage¶

Synthyverse has two layers for tabular synthesis:

Low-level components give explicit control over preprocessing, generator fitting, metrics, and persistence.
High-level wrappers combine those pieces into shorter workflows for common use cases.

Most users should start with SynthyverseGenerator for a single generator workflow or TabularSynthesisBenchmark for repeatable train/evaluate runs. Use the low-level classes when you want to inspect intermediate data, reuse one processor across generators, or call metric classes directly.

Low-Level Preprocessing¶

Low-level generators expect data that is already suitable for the model. Shared tabular preprocessing is handled by DataProcessor.

DataProcessor can:

drop, keep, or impute missing numerical values;
apply equality and inequality constraints before training;
restore the original column order, dtypes, and numerical precision after generation.

import pandas as pd
from synthyverse.generators import DataProcessor

X = pd.read_csv("data.csv")
discrete_features = ["category", "target"]

processor = DataProcessor(
    constraints=["total=part_a+part_b", "age>=18"],
    missing_imputation_method="median",
    random_state=42,
)

X_model = processor.preprocess(X, discrete_features=discrete_features)

If you provide validation data, both datasets are transformed with the same fitted processor.

X_train_model, X_val_model = processor.preprocess(
    X_train,
    discrete_features=discrete_features,
    X_val=X_val,
)

After a generator produces model-space data, call postprocess() to return to the original schema.

X_syn_model = generator.generate(1000)
X_syn = processor.postprocess(X_syn_model)

Missing Values¶

The current tabular imputation methods are:

drop: remove rows with missing numerical values.
keep: leave missing values unchanged for generators that can handle them.
mean: fill numerical missing values with the mean.
median: fill numerical missing values with the median.
most_frequent: fill numerical missing values with the most frequent value.
missforest: use iterative imputation with a random forest regressor.

Missing categorical values are left in categorical columns and can be handled by generator-specific categorical processing.

Constraints¶

Constraints are strings evaluated against the tabular columns.

Equality constraints remove one side of the equation during preprocessing and recompute it during postprocessing.

processor = DataProcessor(constraints=["total=part_a+part_b"])

Inequality constraints store a difference during preprocessing and reconstruct the constrained column during postprocessing.

processor = DataProcessor(constraints=["age>=18", "income>expenses"])

Constraints should already hold in the training data. Imputation can change constrained columns before the constraint transform runs, so constraints are most reliable when the constrained columns are complete or when the chosen imputation preserves the intended relationship.

Low-Level Generators¶

Every public generator inherits from BaseGenerator and follows the same public interface:

from synthyverse.generators import CTGANGenerator

generator = CTGANGenerator(epochs=300, batch_size=500, random_state=42)
generator.fit(X_model, discrete_features=discrete_features)
X_syn_model = generator.generate(1000)

Constructor arguments are generator-specific model hyperparameters. Shared preprocessing arguments such as constraints and missing_imputation_method belong on DataProcessor or SynthyverseGenerator, not on the low-level generator classes.

You can resolve generators by registry name when building configurable workflows.

from synthyverse.generators import get_generator

Generator = get_generator("ctgan")
generator = Generator(epochs=100, random_state=42)

Common registry names include arf, bn, ctgan, tvae, tabsyn, cdtd, tabargn, tabddpm, univariate, and smote.

Custom Modular Workflows¶

The modular setup is useful when you want to control the full workflow yourself instead of letting SynthyverseGenerator do every step. The core pattern is:

Fit a DataProcessor on real data.
Train one or more low-level generators on the processed model-space data.
Generate model-space synthetic data.
Use the same fitted processor to restore the original schema.

import pandas as pd
from synthyverse.generators import CTGANGenerator, DataProcessor

X_train = pd.read_csv("train.csv")
X_val = pd.read_csv("validation.csv")
discrete_features = ["category", "target"]

processor = DataProcessor(
    constraints=["total=part_a+part_b", "age>=18"],
    missing_imputation_method="median",
    random_state=42,
)

X_train_model, X_val_model = processor.preprocess(
    X_train,
    discrete_features=discrete_features,
    X_val=X_val,
)

generator = CTGANGenerator(epochs=300, batch_size=500, random_state=42)
generator.fit(
    X_train_model,
    discrete_features=discrete_features,
    X_val=X_val_model,
)

X_syn_model = generator.generate(1000)
X_syn = processor.postprocess(X_syn_model)

In this workflow, the generator only sees the processed model-space columns. The fitted processor owns the dataset-level contract: missing-value handling, constraint transforms, column order, dtypes, and numeric precision. That is why the same processor must be used for both preprocessing related real datasets and postprocessing generated data.

Because DataProcessor.preprocess() fits only on the first call, you can reuse a fitted processor to transform later datasets with the same original schema. This is useful when you want synthetic train and test samples in the same schema, or when you want to score a held-out real dataset in model space.

X_test_model = processor.preprocess(X_test)
X_syn_test = processor.postprocess(generator.generate(len(X_test)))

You can also reuse one processed dataset across multiple generators. This keeps preprocessing fixed while you compare model behavior.

from synthyverse.generators import CTGANGenerator, TVAEGenerator

generators = {
    "ctgan": CTGANGenerator(epochs=300, random_state=42),
    "tvae": TVAEGenerator(epochs=300, random_state=42),
}

synthetic_sets = {}
for name, generator in generators.items():
    generator.fit(
        X_train_model,
        discrete_features=discrete_features,
        X_val=X_val_model,
    )
    synthetic_sets[name] = processor.postprocess(generator.generate(1000))

For configurable experiments, combine the registry with the same modular pattern.

from synthyverse.generators import get_generator

generator_configs = {
    "ctgan": {"epochs": 300, "batch_size": 500},
    "tvae": {"epochs": 300},
}

synthetic_sets = {}
for generator_name, params in generator_configs.items():
    Generator = get_generator(generator_name)
    generator = Generator(**params, random_state=42)
    generator.fit(X_train_model, discrete_features=discrete_features)
    X_syn_model = generator.generate(1000)
    synthetic_sets[generator_name] = processor.postprocess(X_syn_model)

When you need persistence, save the processor and each low-level generator separately. Load both pieces later and keep the same order: generate with the low-level generator, then postprocess with the loaded processor.

processor.save("saved_models/shared_processor")
generator.save("saved_models/ctgan_low_level")

loaded_processor = DataProcessor.load("saved_models/shared_processor")
loaded_generator = CTGANGenerator.load("saved_models/ctgan_low_level")

X_syn_model = loaded_generator.generate(1000)
X_syn = loaded_processor.postprocess(X_syn_model)

High-Level Generator Wrapper¶

SynthyverseGenerator combines a DataProcessor with any low-level generator. It accepts a generator registry name, class, or instance.

from synthyverse.generators import SynthyverseGenerator

generator = SynthyverseGenerator(
    "ctgan",
    generator_params={"epochs": 300, "batch_size": 500},
    constraints=["total=part_a+part_b"],
    missing_imputation_method="median",
    random_state=42,
)

generator.fit(X, discrete_features=discrete_features)
X_syn = generator.generate(1000)

The wrapper preprocesses X, fits the low-level generator, samples model-space rows, and postprocesses those rows back into the original schema. The wrapped pieces remain available as generator.generator and generator.processor when you need lower-level control.

You can also pass a preconfigured processor.

from synthyverse.generators import DataProcessor, SynthyverseGenerator, TVAEGenerator

processor = DataProcessor(missing_imputation_method="most_frequent", random_state=42)
wrapper = SynthyverseGenerator(
    TVAEGenerator(epochs=100, random_state=42),
    processor=processor,
)

wrapper.fit(X, discrete_features=discrete_features)
X_syn = wrapper.sample(500)

Metrics and Evaluation¶

Metric classes can be used directly when you want full control.

from synthyverse.evaluation import Wasserstein, DCR, MLE

wasserstein = Wasserstein(discrete_features=discrete_features)
dcr = DCR(discrete_features=discrete_features)
mle = MLE(
    target_column="target",
    discrete_features=discrete_features,
    train_set="synthetic",
    random_state=42,
)

fidelity_results = wasserstein.evaluate(X_train=X_train, X_syn=X_syn)
privacy_results = dcr.evaluate(X_train=X_train, X_syn=X_syn)
utility_results = mle.evaluate(X_train=X_train, X_test=X_test, X_syn=X_syn, X_val=X_val)

Use TabularMetricEvaluator to run a group of metrics with consistent metadata. The evaluator is imported from the eval submodule.

from synthyverse.evaluation.eval import TabularMetricEvaluator

evaluator = TabularMetricEvaluator(
    metrics={
        "wasserstein": {},
        "dcr": {},
        "mle-tstr": {"train_set": "synthetic"},
        "mle-trts": {"train_set": "real"},
    },
    discrete_features=discrete_features,
    target_column="target",
    random_state=42,
)

results = evaluator.evaluate(
    X_train=X_train,
    X_test=X_test,
    X_syn=X_syn,
    X_syn_test=X_syn_test,
    X_val=X_val,
)

Metric registry names are resolved with get_metric(). The suffix after a dash is ignored for lookup, which lets you run several configurations of the same metric in one evaluator.

from synthyverse.evaluation import get_metric

Metric = get_metric("wasserstein")
metric = Metric(discrete_features=discrete_features)

Benchmarking¶

TabularSynthesisBenchmark is the highest-level workflow. It creates train/validation/test splits, fits or loads processors, trains or loads generators, samples synthetic datasets, evaluates metrics, and writes long-format result rows.

from synthyverse.benchmark.synthesis import TabularSynthesisBenchmark

benchmark = TabularSynthesisBenchmark(
    X=X,
    save_dir="runs/ctgan",
    generator="ctgan",
    generator_params={"epochs": 300, "batch_size": 500},
    categorical_features=discrete_features,
    target_column="target",
    constraints=["total=part_a+part_b"],
    missing_imputation_method="median",
    random_state=42,
)

results = benchmark.run(
    metrics=["wasserstein", "dcr"],
    n_train_seeds=3,
    n_sets=2,
)

Use train() and eval() separately when you want to reuse saved models or run new metrics later.

benchmark.train(n_train_seeds=3)
results = benchmark.eval(
    metrics={"mle-tstr": {"train_set": "synthetic"}},
    n_train_seeds=3,
    n_sets=1,
)

Saving and Loading¶

Low-level generators save their fitted state in a directory containing generator.pkl.

from synthyverse.generators import CTGANGenerator

generator = CTGANGenerator(epochs=300, random_state=42)
generator.fit(X_model, discrete_features)
generator.save("saved_models/ctgan_low_level")

loaded = CTGANGenerator.load("saved_models/ctgan_low_level")
X_syn_model = loaded.generate(1000)

DataProcessor saves to processor.pkl when given a directory.

from synthyverse.generators import DataProcessor

processor.save("saved_models/processor")
loaded_processor = DataProcessor.load("saved_models/processor")

SynthyverseGenerator saves the wrapper state, processor, and wrapped generator together. This is the easiest persistence option when you want to generate data in the original schema after loading.

generator = SynthyverseGenerator("ctgan", generator_params={"epochs": 300})
generator.fit(X, discrete_features=discrete_features)
generator.save("saved_models/ctgan_wrapper")

loaded = SynthyverseGenerator.load("saved_models/ctgan_wrapper")
X_syn = loaded.generate(1000)

Practical Guidance¶

Start with SynthyverseGenerator when you need one synthetic dataset and want preprocessing handled for you. Use DataProcessor plus a low-level generator when you need to inspect or reuse the model-space data. Use TabularMetricEvaluator for small metric suites, and TabularSynthesisBenchmark when you need reproducible splits, saved artifacts, multiple seeds, or repeated synthetic sets.