XGenBoost Diffusion¶

class synthyverse.generators.xgenboost_generator.XGB_Diffusion_Generator(target_column, timesteps=50, noise_samples_per_row=100, n_jobs=-1, n_jobs_xgb=1, beta_min=0.1, beta_max=8.0, eps=0.0, xgboost_params={'max_depth': 7, 'n_estimators': 100, 'reg_lambda': 0.0}, random_state=0, clip_extremes=True, sampler='ddpm', objective='x', dropout=0.1, dropout_token='mean', **kwargs)[source]¶

Bases: TabularBaseGenerator

XGenBoost diffusion generator.

Denoising Diffusion Probabilistic Model (DDPM) using XGBoost as score estimator.

Parameters:

target_column (str) – Name of the target column.
timesteps (int) – Number of diffusion timesteps. Default: 50.
noise_samples_per_row (int) – Number of noise levels per row. Default: 100.
n_jobs (int) – Number of parallel jobs used across timesteps/features. Default: -1.
n_jobs_xgb (int) – Number of threads used per XGBoost model. Default: 1.
beta_min (float) – Minimum beta value for the variance-preserving schedule. Default: 0.1.
beta_max (float) – Maximum beta value for the variance-preserving schedule. Default: 8.0.
eps (float) – Lower bound for the diffusion time grid. Default: 0.0.
xgboost_params (Optional[Dict[str, Any]]) – Base parameters for XGBoost regressors/classifiers. Default: {“n_estimators”: 100, “max_depth”: 7, “reg_lambda”: 0.0}.
random_state (int) – Random seed for reproducibility. Default: 0.
clip_extremes (bool) – Whether to clip synthesized numerical values to observed training min/max. Default: True.
sampler (str) – Numerical reverse sampler. Options: “ddpm”, “ddim”. Default: “ddpm”.
objective (str) – Numerical prediction objective. Options: “x”, “v”. Default: “x”.
dropout (float) – Feature dropout probability applied to numerical inputs during training. Default: 0.1.
dropout_token (str) – Token used when dropping numerical inputs. Options: “mean”, “missing”, “random”. Default: “mean”.
**kwargs – Additional arguments passed to TabularBaseGenerator.

Example

>>> import pandas as pd
>>> from synthyverse.generators import XGB_Diffusion_Generator
>>>
>>> # Load data
>>> X = pd.read_csv("data.csv")
>>> discrete_features = ["target", "category_col"]
>>>
>>> # Create generator (requires target column)
>>> generator = XGB_Diffusion_Generator(
...     target_column="target",
...     timesteps=50,
...     sampler="ddpm",
...     random_state=42
... )
>>>
>>> # Fit and generate
>>> generator.fit(X, discrete_features)
>>> X_syn = generator.generate(1000)