TabSyn

class synthyverse.generators.tabsyn_generator.TabSynGenerator(target_column, vae_lr=0.001, vae_wd=0, vae_d_token=4, vae_n_head=1, vae_factor=32, vae_num_layers=2, vae_batch_size=4096, vae_num_epochs=4000, vae_min_beta=1e-05, vae_max_beta=0.01, vae_lambda=0.7, diffusion_batch_size=4096, diffusion_num_epochs=10001, diffusion_dim_t=1024, diffusion_lr=0.001, diffusion_wd=0, diffusion_sampling_steps=50, diffusion_patience=500, num_workers=0, random_state=0)

Bases: BaseGenerator

Registry name: tabsyn

TabSyn: a latent diffusion model for tabular data.

Trains a VAE to learn a latent representation of tabular data, then fits a diffusion model in that latent space.

Paper: “Mixed-type tabular data synthesis with score-based diffusion in latent space” by Zhang et al. (2023). Based on the paper’s original implementation: https://github.com/amazon-science/tabsyn/

Parameters:
  • target_column (str) – Name of the target column used for stratified validation splitting.

  • vae_lr (float) – Learning rate for VAE training. Default: 1e-3.

  • vae_wd (float) – Weight decay used by the VAE optimizer. Default: 0.

  • vae_d_token (int) – Token embedding dimension used by the VAE. Default: 4.

  • vae_n_head (int) – Number of attention heads in the VAE transformer blocks. Default: 1.

  • vae_factor (int) – Expansion factor used in VAE feed-forward layers. Default: 32.

  • vae_num_layers (int) – Number of VAE encoder/decoder layers. Default: 2.

  • vae_batch_size (int) – Batch size used to train the VAE. Default: 4096.

  • vae_num_epochs (int) – Maximum number of VAE epochs. Default: 4000.

  • vae_min_beta (float) – Minimum KL coefficient for VAE KL annealing. Default: 1e-5.

  • vae_max_beta (float) – Initial/maximum KL coefficient for VAE KL annealing. Default: 1e-2.

  • vae_lambda (float) – Multiplicative decay factor applied to beta when validation plateaus. Default: 0.7.

  • diffusion_batch_size (int) – Batch size used to train the latent diffusion model. Default: 4096.

  • diffusion_num_epochs (int) – Maximum number of diffusion training epochs. Default: 10001.

  • diffusion_dim_t (int) – Time embedding dimension used by the diffusion denoiser. Default: 1024.

  • diffusion_lr (float) – Learning rate for diffusion training. Default: 1e-3.

  • diffusion_wd (float) – Weight decay used by the diffusion optimizer. Default: 0.

  • diffusion_sampling_steps (int) – Number of reverse diffusion steps used for generation. Default: 50.

  • diffusion_patience (int) – Patience in diffusion training for early stopping. Default: 500.

  • num_workers (int) – Number of workers for PyTorch data loaders. Increase to speed up training - but may cause issues when locally training on Windows OS. Default: 0.

  • random_state (int) – Random seed for reproducibility. Default: 0.

Example

>>> import pandas as pd
>>> from synthyverse.generators import TabSynGenerator
>>>
>>> # Load data and define categorical columns
>>> X = pd.read_csv("data.csv")
>>> discrete_features = ["target", "category_col"]
>>>
>>> # Create generator (requires target column)
>>> generator = TabSynGenerator(
...     target_column="target",
...     vae_num_epochs=100,
...     diffusion_num_epochs=500,
...     random_state=42
... )
>>>
>>> # Fit and generate
>>> generator.fit(X, discrete_features)
>>> X_syn = generator.generate(1000)
fit(X, discrete_features, X_val=None)

Fit the generator to tabular data.

Parameters:
  • X (DataFrame) – Training data in the generator’s input space.

  • discrete_features (list) – Names of categorical/discrete columns in X.

  • X_val (Optional[DataFrame]) – Optional validation data in the same schema as X.

Returns:

The fitted generator.

generate(n)

Generate synthetic tabular data.

Parameters:

n (int) – Number of synthetic rows to generate.

Returns:

Synthetic data in the generator’s model space.

classmethod load(path)

Load a generator persisted with the default pickle layout.

save(path)

Persist the generator state with the default pickle layout.