Synthpop

class synthyverse.generators.synthpop_generator.SynthpopGenerator(smoothing=False, proper=False, minibucket=5, tree_params={}, random_state=0, **kwargs)[source]

Bases: TabularBaseGenerator

From the popular Synthpop R package.

Synthpop uses CART (Classification and Regression Trees) to model conditional marginal distributions. Synthetic data is autoregressively generated by sampling from leaf nodes.

Uses the python-synthpop pypi package implementation.

Paper: “synthpop: Bespoke creation of synthetic data in R” by Nowok et al. (2016).

Parameters:
  • smoothing (bool) – Whether to use smoothing for continuous variables. Default: False.

  • proper (bool) – Whether to apply a resampling (proper) step during fitting. Default: False.

  • minibucket (int) – Minimum samples in the leaf nodes. Increase to reduce overfitting. Default: 5.

  • tree_params (dict) – Dictionary of additional parameters for tree construction (scikit-learn decision trees). Default: {}.

  • random_state (int) – Random seed for reproducibility. Default: 0.

  • **kwargs – Additional arguments passed to TabularBaseGenerator.

Example

>>> import pandas as pd
>>> from synthyverse.generators import SynthpopGenerator
>>>
>>> # Load data
>>> X = pd.read_csv("data.csv")
>>> discrete_features = ["category_col"]
>>>
>>> # Create generator
>>> generator = SynthpopGenerator(
...     smoothing=True,
...     proper=True,
...     minibucket=5,
...     random_state=42
... )
>>>
>>> # Fit and generate
>>> generator.fit(X, discrete_features)
>>> X_syn = generator.generate(1000)