Synthpop¶
- class synthyverse.generators.synthpop_generator.SynthpopGenerator(smoothing=False, proper=False, minibucket=5, tree_params={}, random_state=0, **kwargs)[source]¶
Bases:
TabularBaseGeneratorFrom the popular Synthpop R package.
Synthpop uses CART (Classification and Regression Trees) to model conditional marginal distributions. Synthetic data is autoregressively generated by sampling from leaf nodes.
Uses the python-synthpop pypi package implementation.
Paper: “synthpop: Bespoke creation of synthetic data in R” by Nowok et al. (2016).
- Parameters:
smoothing (bool) – Whether to use smoothing for continuous variables. Default: False.
proper (bool) – Whether to apply a resampling (proper) step during fitting. Default: False.
minibucket (int) – Minimum samples in the leaf nodes. Increase to reduce overfitting. Default: 5.
tree_params (dict) – Dictionary of additional parameters for tree construction (scikit-learn decision trees). Default: {}.
random_state (int) – Random seed for reproducibility. Default: 0.
**kwargs – Additional arguments passed to TabularBaseGenerator.
Example
>>> import pandas as pd >>> from synthyverse.generators import SynthpopGenerator >>> >>> # Load data >>> X = pd.read_csv("data.csv") >>> discrete_features = ["category_col"] >>> >>> # Create generator >>> generator = SynthpopGenerator( ... smoothing=True, ... proper=True, ... minibucket=5, ... random_state=42 ... ) >>> >>> # Fit and generate >>> generator.fit(X, discrete_features) >>> X_syn = generator.generate(1000)