# In-Depth Usage Synthyverse has two layers for tabular synthesis: - **Low-level components** give explicit control over preprocessing, generator fitting, metrics, and persistence. - **High-level wrappers** combine those pieces into shorter workflows for common use cases. Most users should start with `SynthyverseGenerator` for a single generator workflow or `TabularSynthesisBenchmark` for repeatable train/evaluate runs. Use the low-level classes when you want to inspect intermediate data, reuse one processor across generators, or call metric classes directly. ## Low-Level Preprocessing Low-level generators expect data that is already suitable for the model. Shared tabular preprocessing is handled by `DataProcessor`. `DataProcessor` can: - drop, keep, or impute missing numerical values; - apply equality and inequality constraints before training; - restore the original column order, dtypes, and numerical precision after generation. ```python import pandas as pd from synthyverse.generators import DataProcessor X = pd.read_csv("data.csv") discrete_features = ["category", "target"] processor = DataProcessor( constraints=["total=part_a+part_b", "age>=18"], missing_imputation_method="median", random_state=42, ) X_model = processor.preprocess(X, discrete_features=discrete_features) ``` If you provide validation data, both datasets are transformed with the same fitted processor. ```python X_train_model, X_val_model = processor.preprocess( X_train, discrete_features=discrete_features, X_val=X_val, ) ``` After a generator produces model-space data, call `postprocess()` to return to the original schema. ```python X_syn_model = generator.generate(1000) X_syn = processor.postprocess(X_syn_model) ``` ### Missing Values The current tabular imputation methods are: - `drop`: remove rows with missing numerical values. - `keep`: leave missing values unchanged for generators that can handle them. - `mean`: fill numerical missing values with the mean. - `median`: fill numerical missing values with the median. - `most_frequent`: fill numerical missing values with the most frequent value. - `missforest`: use iterative imputation with a random forest regressor. Missing categorical values are left in categorical columns and can be handled by generator-specific categorical processing. ### Constraints Constraints are strings evaluated against the tabular columns. Equality constraints remove one side of the equation during preprocessing and recompute it during postprocessing. ```python processor = DataProcessor(constraints=["total=part_a+part_b"]) ``` Inequality constraints store a difference during preprocessing and reconstruct the constrained column during postprocessing. ```python processor = DataProcessor(constraints=["age>=18", "income>expenses"]) ``` Constraints should already hold in the training data. Imputation can change constrained columns before the constraint transform runs, so constraints are most reliable when the constrained columns are complete or when the chosen imputation preserves the intended relationship. ## Low-Level Generators Every public generator inherits from `BaseGenerator` and follows the same public interface: ```python from synthyverse.generators import CTGANGenerator generator = CTGANGenerator(epochs=300, batch_size=500, random_state=42) generator.fit(X_model, discrete_features=discrete_features) X_syn_model = generator.generate(1000) ``` Constructor arguments are generator-specific model hyperparameters. Shared preprocessing arguments such as `constraints` and `missing_imputation_method` belong on `DataProcessor` or `SynthyverseGenerator`, not on the low-level generator classes. You can resolve generators by registry name when building configurable workflows. ```python from synthyverse.generators import get_generator Generator = get_generator("ctgan") generator = Generator(epochs=100, random_state=42) ``` Common registry names include `arf`, `bn`, `ctgan`, `tvae`, `tabsyn`, `cdtd`, `tabargn`, `tabddpm`, `univariate`, and `smote`. ## Custom Modular Workflows The modular setup is useful when you want to control the full workflow yourself instead of letting `SynthyverseGenerator` do every step. The core pattern is: 1. Fit a `DataProcessor` on real data. 2. Train one or more low-level generators on the processed model-space data. 3. Generate model-space synthetic data. 4. Use the same fitted processor to restore the original schema. ```python import pandas as pd from synthyverse.generators import CTGANGenerator, DataProcessor X_train = pd.read_csv("train.csv") X_val = pd.read_csv("validation.csv") discrete_features = ["category", "target"] processor = DataProcessor( constraints=["total=part_a+part_b", "age>=18"], missing_imputation_method="median", random_state=42, ) X_train_model, X_val_model = processor.preprocess( X_train, discrete_features=discrete_features, X_val=X_val, ) generator = CTGANGenerator(epochs=300, batch_size=500, random_state=42) generator.fit( X_train_model, discrete_features=discrete_features, X_val=X_val_model, ) X_syn_model = generator.generate(1000) X_syn = processor.postprocess(X_syn_model) ``` In this workflow, the generator only sees the processed model-space columns. The fitted processor owns the dataset-level contract: missing-value handling, constraint transforms, column order, dtypes, and numeric precision. That is why the same processor must be used for both preprocessing related real datasets and postprocessing generated data. Because `DataProcessor.preprocess()` fits only on the first call, you can reuse a fitted processor to transform later datasets with the same original schema. This is useful when you want synthetic train and test samples in the same schema, or when you want to score a held-out real dataset in model space. ```python X_test_model = processor.preprocess(X_test) X_syn_test = processor.postprocess(generator.generate(len(X_test))) ``` You can also reuse one processed dataset across multiple generators. This keeps preprocessing fixed while you compare model behavior. ```python from synthyverse.generators import CTGANGenerator, TVAEGenerator generators = { "ctgan": CTGANGenerator(epochs=300, random_state=42), "tvae": TVAEGenerator(epochs=300, random_state=42), } synthetic_sets = {} for name, generator in generators.items(): generator.fit( X_train_model, discrete_features=discrete_features, X_val=X_val_model, ) synthetic_sets[name] = processor.postprocess(generator.generate(1000)) ``` For configurable experiments, combine the registry with the same modular pattern. ```python from synthyverse.generators import get_generator generator_configs = { "ctgan": {"epochs": 300, "batch_size": 500}, "tvae": {"epochs": 300}, } synthetic_sets = {} for generator_name, params in generator_configs.items(): Generator = get_generator(generator_name) generator = Generator(**params, random_state=42) generator.fit(X_train_model, discrete_features=discrete_features) X_syn_model = generator.generate(1000) synthetic_sets[generator_name] = processor.postprocess(X_syn_model) ``` When you need persistence, save the processor and each low-level generator separately. Load both pieces later and keep the same order: generate with the low-level generator, then postprocess with the loaded processor. ```python processor.save("saved_models/shared_processor") generator.save("saved_models/ctgan_low_level") loaded_processor = DataProcessor.load("saved_models/shared_processor") loaded_generator = CTGANGenerator.load("saved_models/ctgan_low_level") X_syn_model = loaded_generator.generate(1000) X_syn = loaded_processor.postprocess(X_syn_model) ``` ## High-Level Generator Wrapper `SynthyverseGenerator` combines a `DataProcessor` with any low-level generator. It accepts a generator registry name, class, or instance. ```python from synthyverse.generators import SynthyverseGenerator generator = SynthyverseGenerator( "ctgan", generator_params={"epochs": 300, "batch_size": 500}, constraints=["total=part_a+part_b"], missing_imputation_method="median", random_state=42, ) generator.fit(X, discrete_features=discrete_features) X_syn = generator.generate(1000) ``` The wrapper preprocesses `X`, fits the low-level generator, samples model-space rows, and postprocesses those rows back into the original schema. The wrapped pieces remain available as `generator.generator` and `generator.processor` when you need lower-level control. You can also pass a preconfigured processor. ```python from synthyverse.generators import DataProcessor, SynthyverseGenerator, TVAEGenerator processor = DataProcessor(missing_imputation_method="most_frequent", random_state=42) wrapper = SynthyverseGenerator( TVAEGenerator(epochs=100, random_state=42), processor=processor, ) wrapper.fit(X, discrete_features=discrete_features) X_syn = wrapper.sample(500) ``` ## Metrics and Evaluation Metric classes can be used directly when you want full control. ```python from synthyverse.evaluation import Wasserstein, DCR, MLE wasserstein = Wasserstein(discrete_features=discrete_features) dcr = DCR(discrete_features=discrete_features) mle = MLE( target_column="target", discrete_features=discrete_features, train_set="synthetic", random_state=42, ) fidelity_results = wasserstein.evaluate(X_train=X_train, X_syn=X_syn) privacy_results = dcr.evaluate(X_train=X_train, X_syn=X_syn) utility_results = mle.evaluate(X_train=X_train, X_test=X_test, X_syn=X_syn, X_val=X_val) ``` Use `TabularMetricEvaluator` to run a group of metrics with consistent metadata. The evaluator is imported from the `eval` submodule. ```python from synthyverse.evaluation.eval import TabularMetricEvaluator evaluator = TabularMetricEvaluator( metrics={ "wasserstein": {}, "dcr": {}, "mle-tstr": {"train_set": "synthetic"}, "mle-trts": {"train_set": "real"}, }, discrete_features=discrete_features, target_column="target", random_state=42, ) results = evaluator.evaluate( X_train=X_train, X_test=X_test, X_syn=X_syn, X_syn_test=X_syn_test, X_val=X_val, ) ``` Metric registry names are resolved with `get_metric()`. The suffix after a dash is ignored for lookup, which lets you run several configurations of the same metric in one evaluator. ```python from synthyverse.evaluation import get_metric Metric = get_metric("wasserstein") metric = Metric(discrete_features=discrete_features) ``` ## Benchmarking `TabularSynthesisBenchmark` is the highest-level workflow. It creates train/validation/test splits, fits or loads processors, trains or loads generators, samples synthetic datasets, evaluates metrics, and writes long-format result rows. ```python from synthyverse.benchmark.synthesis import TabularSynthesisBenchmark benchmark = TabularSynthesisBenchmark( X=X, save_dir="runs/ctgan", generator="ctgan", generator_params={"epochs": 300, "batch_size": 500}, categorical_features=discrete_features, target_column="target", constraints=["total=part_a+part_b"], missing_imputation_method="median", random_state=42, ) results = benchmark.run( metrics=["wasserstein", "dcr"], n_train_seeds=3, n_sets=2, ) ``` Use `train()` and `eval()` separately when you want to reuse saved models or run new metrics later. ```python benchmark.train(n_train_seeds=3) results = benchmark.eval( metrics={"mle-tstr": {"train_set": "synthetic"}}, n_train_seeds=3, n_sets=1, ) ``` ## Saving and Loading Low-level generators save their fitted state in a directory containing `generator.pkl`. ```python from synthyverse.generators import CTGANGenerator generator = CTGANGenerator(epochs=300, random_state=42) generator.fit(X_model, discrete_features) generator.save("saved_models/ctgan_low_level") loaded = CTGANGenerator.load("saved_models/ctgan_low_level") X_syn_model = loaded.generate(1000) ``` `DataProcessor` saves to `processor.pkl` when given a directory. ```python from synthyverse.generators import DataProcessor processor.save("saved_models/processor") loaded_processor = DataProcessor.load("saved_models/processor") ``` `SynthyverseGenerator` saves the wrapper state, processor, and wrapped generator together. This is the easiest persistence option when you want to generate data in the original schema after loading. ```python generator = SynthyverseGenerator("ctgan", generator_params={"epochs": 300}) generator.fit(X, discrete_features=discrete_features) generator.save("saved_models/ctgan_wrapper") loaded = SynthyverseGenerator.load("saved_models/ctgan_wrapper") X_syn = loaded.generate(1000) ``` ## Practical Guidance Start with `SynthyverseGenerator` when you need one synthetic dataset and want preprocessing handled for you. Use `DataProcessor` plus a low-level generator when you need to inspect or reuse the model-space data. Use `TabularMetricEvaluator` for small metric suites, and `TabularSynthesisBenchmark` when you need reproducible splits, saved artifacts, multiple seeds, or repeated synthetic sets.