XGenBoost Diffusion¶
- class synthyverse.generators.xgenboost_generator.XGB_Diffusion_Generator(target_column, timesteps=50, noise_samples_per_row=100, n_jobs=-1, n_jobs_xgb=1, beta_min=0.1, beta_max=8.0, eps=0.0, xgboost_params={'max_depth': 7, 'n_estimators': 100, 'reg_lambda': 0.0}, random_state=0, clip_extremes=True, sampler='ddpm', objective='x', dropout=0.1, dropout_token='mean', **kwargs)[source]¶
Bases:
TabularBaseGeneratorXGenBoost diffusion generator.
Denoising Diffusion Probabilistic Model (DDPM) using XGBoost as score estimator.
- Parameters:
target_column (str) – Name of the target column.
timesteps (int) – Number of diffusion timesteps. Default: 50.
noise_samples_per_row (int) – Number of noise levels per row. Default: 100.
n_jobs (int) – Number of parallel jobs used across timesteps/features. Default: -1.
n_jobs_xgb (int) – Number of threads used per XGBoost model. Default: 1.
beta_min (float) – Minimum beta value for the variance-preserving schedule. Default: 0.1.
beta_max (float) – Maximum beta value for the variance-preserving schedule. Default: 8.0.
eps (float) – Lower bound for the diffusion time grid. Default: 0.0.
xgboost_params (Optional[Dict[str, Any]]) – Base parameters for XGBoost regressors/classifiers. Default: {“n_estimators”: 100, “max_depth”: 7, “reg_lambda”: 0.0}.
random_state (int) – Random seed for reproducibility. Default: 0.
clip_extremes (bool) – Whether to clip synthesized numerical values to observed training min/max. Default: True.
sampler (str) – Numerical reverse sampler. Options: “ddpm”, “ddim”. Default: “ddpm”.
objective (str) – Numerical prediction objective. Options: “x”, “v”. Default: “x”.
dropout (float) – Feature dropout probability applied to numerical inputs during training. Default: 0.1.
dropout_token (str) – Token used when dropping numerical inputs. Options: “mean”, “missing”, “random”. Default: “mean”.
**kwargs – Additional arguments passed to TabularBaseGenerator.
Example
>>> import pandas as pd >>> from synthyverse.generators import XGB_Diffusion_Generator >>> >>> # Load data >>> X = pd.read_csv("data.csv") >>> discrete_features = ["target", "category_col"] >>> >>> # Create generator (requires target column) >>> generator = XGB_Diffusion_Generator( ... target_column="target", ... timesteps=50, ... sampler="ddpm", ... random_state=42 ... ) >>> >>> # Fit and generate >>> generator.fit(X, discrete_features) >>> X_syn = generator.generate(1000)