XGenBoost AR

class synthyverse.generators.xgenboost_generator.XGB_AR_Generator(target_column, conditioning='inference', xgboost_params={'device': 'cpu', 'early_stopping_rounds': 20, 'max_bin': 256, 'max_depth': 3, 'n_estimators': 30}, use_early_stopping=False, temperature=1.0, discretization='quantile', per_bin_sampling='eqf', cat_merge_type='clustering', cat_merge_n_infrequent=5, visit_order_method='naive', visit_order_mode='ascending', random_state=0, n_jobs_xgb=1, n_jobs=-1, H=5, route_method='routing', start_method='bootstrap', **kwargs)[source]

Bases: XGenBoost

XGenBoost autoregressive generator.

Trains a hierarchical autoregressive model where conditionals are learned by XGBoost classifiers.

Parameters:
  • target_column (str) – Name of the target column.

  • conditioning (str) – Conditioning mode. Options: “generation”, “inference”. Default: “inference”.

  • xgboost_params (dict) – Parameters passed to each underlying XGBoost model. Default: {“n_estimators”: 30, “max_depth”: 3, “max_bin”: 256, “early_stopping_rounds”: 20, “device”: “cpu”}.

  • use_early_stopping (bool) – Whether to use validation-based early stopping when validation data is provided. Default: False.

  • temperature (float) – Sampling temperature for posterior sampling. Default: 1.0.

  • discretization (str) – Numerical discretization strategy. Default: “quantile”.

  • per_bin_sampling (str) – Sampling method within numerical bins. Default: “eqf”.

  • cat_merge_type (str) – Strategy for merging infrequent categories. Default: “clustering”.

  • cat_merge_n_infrequent (int) – Number of infrequent category clusters to merge into. Default: 5.

  • visit_order_method (str) – Feature visit-order method. Default: “naive”.

  • visit_order_mode (str) – Visit-order direction. Options: “ascending”, “descending”. Default: “ascending”.

  • random_state (int) – Random seed for reproducibility. Default: 0.

  • n_jobs_xgb (int) – Number of threads used per XGBoost model. Default: 1.

  • n_jobs (int) – Number of parallel jobs used to train/sample across tasks. Default: -1.

  • H (int) – Meta-tree height for numerical features. The number of bins is 2**H. Default: 5.

  • route_method (str) – Numerical routing method. Options: “propagate”, “routing”. Default: “routing”.

  • start_method (str) – Initialization method for the first feature. Options: “bootstrap”, “eqf”. Default: “bootstrap”.

  • **kwargs – Additional arguments passed to TabularBaseGenerator.

Example

>>> import pandas as pd
>>> from synthyverse.generators import XGB_AR_Generator
>>>
>>> # Load data
>>> X = pd.read_csv("data.csv")
>>> discrete_features = ["target", "category_col"]
>>>
>>> # Create generator (requires target column)
>>> generator = XGB_AR_Generator(
...     target_column="target",
...     H=5,
...     random_state=42
... )
>>>
>>> # Fit and generate
>>> generator.fit(X, discrete_features)
>>> X_syn = generator.generate(1000)