CDTD

class synthyverse.generators.cdtd_generator.CDTDGenerator(cat_emb_dim=16, mlp_emb_dim=256, mlp_n_layers=5, mlp_n_units=1024, sigma_data_cat=1.0, sigma_data_cont=1.0, sigma_min_cat=0.0, sigma_min_cont=0.0, sigma_max_cat=100.0, sigma_max_cont=80.0, cat_emb_init_sigma=0.001, timewarp_type='bytype', timewarp_weight_low_noise=1.0, num_steps_train=30000, num_steps_warmup=1000, batch_size=4096, lr=0.001, ema_decay=0.999, log_steps=100, random_state=0)

Bases: BaseGenerator

Registry name: cdtd

Continuous Diffusion for mixed-type Tabular Data (CDTD).

CDTD uses continuous diffusion for mixed-type tabular data. It provides several improvements to homogenize data types in the modelling process.

Uses the simple wrapper implementation from the original paper’s authors (https://github.com/muellermarkus/cdtd_simple)

Paper: “Continuous Diffusion for Mixed-Type Tabular Data” by Mueller et al. (2023).

Parameters:
  • cat_emb_dim (int) – Embedding dimension for categorical features. Default: 16.

  • mlp_emb_dim (int) – Embedding dimension for MLP layers. Default: 256.

  • mlp_n_layers (int) – Number of MLP layers. Default: 5.

  • mlp_n_units (int) – Number of units per MLP layer. Default: 1024.

  • sigma_data_cat (float) – Data sigma for categorical features. Default: 1.0.

  • sigma_data_cont (float) – Data sigma for continuous features. Default: 1.0.

  • sigma_min_cat (float) – Minimum sigma for categorical features. Default: 0.0.

  • sigma_min_cont (float) – Minimum sigma for continuous features. Default: 0.0.

  • sigma_max_cat (float) – Maximum sigma for categorical features. Default: 100.0.

  • sigma_max_cont (float) – Maximum sigma for continuous features. Default: 80.0.

  • cat_emb_init_sigma (float) – Initial sigma for categorical embeddings. Default: 0.001.

  • timewarp_type (str) – Type of time warping. Options: “single”, “bytype”, “all”. Default: “bytype”.

  • timewarp_weight_low_noise (float) – Weight for low noise in time warping. Default: 1.0.

  • num_steps_train (int) – Number of training steps (iterations, not epochs). Default: 30000.

  • num_steps_warmup (int) – Number of warmup steps. Default: 1000.

  • batch_size (int) – Batch size for training. Default: 4096.

  • lr (float) – Learning rate. Default: 1e-3.

  • ema_decay (float) – Exponential moving average decay. Default: 0.999.

  • log_steps (int) – Steps between logging. Default: 100.

  • random_state (int) – Random seed for reproducibility. Default: 0.

Example

>>> import pandas as pd
>>> from synthyverse.generators import CDTDGenerator
>>>
>>> # Load data
>>> X = pd.read_csv("data.csv")
>>> discrete_features = ["category_col"]
>>>
>>> # Create generator
>>> generator = CDTDGenerator(
...     timewarp_type="bytype",
...     num_steps_train=30000,
...     random_state=42
... )
>>>
>>> # Fit and generate
>>> generator.fit(X, discrete_features)
>>> X_syn = generator.generate(1000)
fit(X, discrete_features, X_val=None)

Fit the generator to tabular data.

Parameters:
  • X (DataFrame) – Training data in the generator’s input space.

  • discrete_features (list) – Names of categorical/discrete columns in X.

  • X_val (Optional[DataFrame]) – Optional validation data in the same schema as X.

Returns:

The fitted generator.

generate(n)

Generate synthetic tabular data.

Parameters:

n (int) – Number of synthetic rows to generate.

Returns:

Synthetic data in the generator’s model space.

classmethod load(path)

Load a generator persisted with the default pickle layout.

save(path)

Persist the generator state with the default pickle layout.