RealTabFormer

class synthyverse.generators.realtabformer_generator.RealTabFormerGenerator(workspace, epochs=1000, batch_size=8, mask_rate=0, early_stopping_patience=5, early_stopping_threshold=0, random_state=0, **kwargs)[source]

Bases: TabularBaseGenerator

Realistic Relational and Tabular Data using Transformers.

Fine-tunes GPT-2 for tabular synthetic data generation.

Uses the realtabformer pypi package implementation.

Paper: “Realtabformer: Generating realistic relational and tabular data using transformers” by Solatorio et al. (2023).

Parameters:
  • workspace (str) – Directory for storing checkpoints and samples.

  • epochs (int) – Number of training epochs. Default: 1000.

  • batch_size (int) – Batch size for training. Default: 8.

  • mask_rate (float) – Masking rate for training. Default: 0.

  • early_stopping_patience (int) – Patience for early stopping. Default: 5.

  • early_stopping_threshold (float) – Threshold for early stopping. Default: 0.

  • random_state (int) – Random seed for reproducibility. Default: 0.

  • **kwargs – Additional arguments passed to TabularBaseGenerator.

Example

>>> import pandas as pd
>>> from synthyverse.generators import RealTabFormerGenerator
>>>
>>> # Load data
>>> X = pd.read_csv("data.csv")
>>> discrete_features = ["category_col"]
>>>
>>> # Create generator (requires workspace)
>>> generator = RealTabFormerGenerator(
...     workspace="./realtabformer_workspace",
...     epochs=1000,
...     batch_size=8,
...     random_state=42
... )
>>>
>>> # Fit and generate
>>> generator.fit(X, discrete_features)
>>> X_syn = generator.generate(1000)