Processing and Wrappers¶

class synthyverse.generators.base.DataProcessor(constraints=None, missing_imputation_method='drop', random_state=0)¶

Bases: object

Reusable tabular pre/postprocessor for synthyverse generators.

The first call to preprocess() fits the processor state. Later calls reuse the fitted imputers, constraints, precision, dtypes, and column order, allowing multiple generators to share one processor for the same dataset.

Use preprocess before fitting a low-level generator, then use postprocess on generated model-space data to restore the original schema. For single-generator workflows, SynthyverseGenerator provides the same behavior as a wrapper.

Parameters:

constraints (list, str) – Optional equality or inequality constraints to enforce in model space and restore after generation. Examples: "total=part_a+part_b", "age>=18", or "income>expenses". Default: None.
missing_imputation_method (str) – Missing-value strategy. Options are "drop", "keep", "mean", "median", "most_frequent", and "missforest". Default: "drop".
random_state (int) – Random seed used by stochastic preprocessing steps. Default: 0.

Example

>>> import pandas as pd
>>> from synthyverse.generators import CTGANGenerator, DataProcessor
>>>
>>> X = pd.read_csv("data.csv")
>>> discrete_features = ["category_col"]
>>>
>>> processor = DataProcessor(
...     constraints=["total=part_a+part_b"],
...     missing_imputation_method="median",
...     random_state=42,
... )
>>> X_processed = processor.preprocess(X, discrete_features)
>>>
>>> generator = CTGANGenerator(epochs=300, random_state=42)
>>> generator.fit(X_processed, discrete_features)
>>> X_syn = processor.postprocess(generator.generate(1000))

classmethod load(path)¶

Load a persisted processor from disk.

Parameters:: path (str, Path) – Path to a saved processor.pkl file or a directory containing one.
Returns:: Restored processor.
Return type:: DataProcessor

postprocess(X)¶

Transform generated model-space data back to the original schema.

Applies inverse constraints, restores dropped constraint columns, rounds numeric columns to the original precision, restores the original column order, and casts columns back to their original pandas dtypes.

Parameters:: X (pd.DataFrame) – Generated data in the processor’s model-space schema.
Returns:: Generated data in the original input schema.
Return type:: pd.DataFrame

preprocess(X, discrete_features=None, X_val=None)¶

Fit-if-needed and transform input data for model training.

On the first call, this method records the original schema, fits the missing-value handler, and prepares constraint handling. Later calls reuse that fitted state and only transform data with the same original schema.

Parameters:

X (pd.DataFrame) – Training or later input data in the original schema.
discrete_features (list) – Names of categorical/discrete columns. Required on the first call and optional after the processor is fitted. Default: None.
X_val (pd.DataFrame) – Optional validation data in the same schema as X. Default: None.

Returns:

Processed X when X_val is None, or (X_processed, X_val_processed) when validation data is provided.

Return type:

pd.DataFrame or tuple

save(path)¶

Persist this processor to disk.

path may be either the target file path or a directory. When a directory is provided, the processor is written to processor.pkl.

Return type:: None
Parameters:: path (str | Path)

class synthyverse.generators.base.SynthyverseGenerator(generator, generator_params=None, processor=None, constraints=None, missing_imputation_method='drop', random_state=0, **generator_kwargs)¶

Bases: object

Synthyverse high-level generator wrapper for tabular data.

Combines a low-level BaseGenerator with the shared DataProcessor to provide missing-value handling, constraint handling, dtype restoration, column-order restoration, and numeric precision restoration around any Synthyverse generator.

The wrapped low-level generator and processor remain available through the generator and processor attributes for users who want explicit control over each step.

Parameters:

generator (str, type, BaseGenerator) – Generator name, BaseGenerator subclass, or fitted/unfitted generator instance to wrap.
generator_params (dict) – Keyword arguments used when generator is a name or class. Default: None.
processor (DataProcessor) – Optional preconfigured data processor. If provided, constraints and missing_imputation_method are ignored. Default: None.
constraints (list, str) – Optional equality or inequality constraints applied during preprocessing and reversed during postprocessing. Default: None.
missing_imputation_method (str) – Missing-value strategy used by the created DataProcessor. Options: “drop”, “keep”, “mean”, “median”, “most_frequent”, “missforest”. Default: “drop”.
random_state (int) – Random seed used by the created DataProcessor and by generator classes that accept random_state when no value is supplied in generator_params or generator_kwargs. Default: 0.
**generator_kwargs – Additional keyword arguments passed to the wrapped generator constructor. These override keys in generator_params.

Example

>>> import pandas as pd
>>> from synthyverse.generators import SynthyverseGenerator
>>>
>>> # Load data
>>> X = pd.read_csv("data.csv")
>>> discrete_features = ["category_col"]
>>>
>>> # Create high-level wrapper around a low-level generator
>>> generator = SynthyverseGenerator(
...     "ctgan",
...     generator_params={"epochs": 300, "batch_size": 500},
...     missing_imputation_method="median",
...     random_state=42,
... )
>>>
>>> # Fit and generate data in the original schema
>>> generator.fit(X, discrete_features)
>>> X_syn = generator.generate(1000)

fit(X, discrete_features=None, X_val=None)¶

Fit the high-level generator to tabular data.

Alias for train() for consistency with low-level generators.

Parameters:

X (pd.DataFrame) – Training data in the original tabular schema.
discrete_features (list) – Names of categorical/discrete columns in X. Required when fitting a new processor. Default: None.
X_val (pd.DataFrame) – Optional validation data in the same schema as X. Default: None.

Returns:

The fitted high-level generator.

Return type:

SynthyverseGenerator

generate(n)¶

Generate synthetic tabular data.

Alias for sample() for consistency with low-level generators.

Parameters:: n (int) – Number of synthetic rows to generate.
Returns:: Synthetic data with the original columns, dtypes, and numeric precision restored.
Return type:: pd.DataFrame

classmethod load(path)¶

Load a high-level generator wrapper saved with save().

Parameters:: path (str, Path) – Directory containing the saved wrapper state.
Returns:: Restored high-level generator wrapper.
Return type:: SynthyverseGenerator

sample(n)¶

Generate synthetic rows and restore the original tabular schema.

Parameters:: n (int) – Number of synthetic rows to generate.
Returns:: Synthetic data with the original columns, dtypes, and numeric precision restored.
Return type:: pd.DataFrame

save(path)¶

Persist the wrapper, processor, and wrapped generator to a directory.

Parameters:: path (str, Path) – Directory where the wrapper state should be saved.
Returns:: Directory containing the saved wrapper state.
Return type:: Path

train(X, discrete_features=None, X_val=None)¶

Preprocess data and fit the wrapped low-level generator.

Parameters:

X (pd.DataFrame) – Training data in the original tabular schema.
discrete_features (list) – Names of categorical/discrete columns in X. Required when fitting a new processor. Default: None.
X_val (pd.DataFrame) – Optional validation data in the same schema as X. Default: None.

Returns:

The fitted high-level generator.

Return type:

SynthyverseGenerator

class synthyverse.generators.base.TabularImputer(method='drop', random_state=0)¶

Bases: object

Reusable missing-value transformer for tabular generator inputs.

Handles missing numerical values before data is passed to a low-level generator. It supports dropping rows, keeping missing values, simple imputation, and a MissForest-style iterative imputer. Most users get this through DataProcessor; use it directly when you only need missing value handling without constraints or schema restoration.

Parameters:

method (str) – Missing-value strategy. Options are "drop", "keep", "mean", "median", "most_frequent", and "missforest". Default: "drop".
random_state (int) – Random seed used by stochastic imputers. Default: 0.

Example

>>> import pandas as pd
>>> from synthyverse.generators import TabularImputer
>>>
>>> X = pd.DataFrame({"age": [31, None, 42], "group": ["a", "b", "a"]})
>>> imputer = TabularImputer(method="median", random_state=42)
>>> X_imputed, _ = imputer.fit_transform(X, numerical_features=["age"])
>>> X_later = imputer.transform(X)

fit_transform(X, numerical_features, X_val=None)¶

Fit the imputer on training data and transform train/validation data.

Parameters:

X (pd.DataFrame) – Training data.
numerical_features (list) – Numeric columns to impute for simple strategies and to inspect for "drop".
X_val (pd.DataFrame) – Optional validation data in the same schema as X. Default: None.

Returns:

(X_processed, X_val_processed). The second item is None when no validation data is provided.

Return type:

tuple

transform(X)¶

Transform data with the fitted missing-value strategy.

Parameters:: X (pd.DataFrame) – Data in the same schema used during fitting.
Returns:: Data with missing values handled according to method.
Return type:: pd.DataFrame

class synthyverse.generators.base.TabularSchema(column_order, dtypes, precision)¶

Bases: object

Column order, dtype, and precision contract for tabular data.

Captures the schema of a real pandas DataFrame so generated or transformed data can be restored to the same column order, dtypes, and numeric precision. This is usually managed by DataProcessor, but it can be useful directly when you need to restore model-space data yourself.

Parameters:

column_order (list) – Column names in the desired output order.
dtypes (dict) – Mapping from column names to pandas dtypes.
precision (dict) – Mapping from numeric column names to decimal places.

Example

>>> import pandas as pd
>>> from synthyverse.generators import TabularSchema
>>>
>>> X = pd.DataFrame({"age": [31, 42], "score": [0.25, 0.75]})
>>> schema = TabularSchema.from_dataframe(X)
>>> restored = schema.restore(X[["score", "age"]])
>>> list(restored.columns)
['age', 'score']

classmethod from_dataframe(X, numerical_features=None)¶

Create a schema contract from a DataFrame.

Parameters:

X (pd.DataFrame) – Data whose columns, dtypes, and numeric precision should be captured.
numerical_features (list) – Optional names of numeric columns to inspect for decimal precision. When omitted, numeric columns are inferred from pandas dtypes.

Returns:

Schema object fitted to X.

Return type:

TabularSchema

restore(X)¶

Restore column order, numeric precision, and pandas dtypes.

Parameters:: X (pd.DataFrame) – Data containing the columns captured by the schema.
Returns:: Restored data with the original schema.
Return type:: pd.DataFrame

round_numeric(X)¶

Round numeric columns to the precision captured from real data.

Parameters:: X (pd.DataFrame) – Data to round.
Returns:: Copy of X with known numeric columns rounded.
Return type:: pd.DataFrame

validate_columns(X)¶

Validate that a DataFrame contains every column in the schema.

Parameters:: X (pd.DataFrame) – Data to validate.
Raises:: ValueError – If one or more required columns are missing.
Return type:: None

class synthyverse.generators.base.ConstraintEnforcer(constraints)¶

Bases: object

Apply simple column constraints before and after generation.

Converts equality and inequality constraints into a model-space representation that is easier for generators to learn. Equalities remove one constrained column before training and reconstruct it after generation. Inequalities store the constrained side as a nonnegative difference and add the expression back during inverse transformation.

This class is used internally by DataProcessor, but you can use it directly when you want explicit control over constraint transformations.

Parameters:: constraints (list) – Constraint strings. Equalities use = and inequalities use <, <=, >, or >=. Examples: "total=part_a+part_b", "age>=18", and "income>expenses".

Example

>>> import pandas as pd
>>> from synthyverse.generators import ConstraintEnforcer
>>>
>>> X = pd.DataFrame({"part_a": [2], "part_b": [3], "total": [5]})
>>> enforcer = ConstraintEnforcer(["total=part_a+part_b"])
>>> X_model = enforcer.transform(X)
>>> X_restored = enforcer.inverse_transform(X_model)

inverse_transform(X)¶

Restore constrained columns after generation.

Parameters:: X (pd.DataFrame) – Generated data in the transformed model-space schema.
Returns:: Copy of X with constrained columns reconstructed.
Return type:: pd.DataFrame

transform(X)¶

Transform constrained data into model space.

Parameters:: X (pd.DataFrame) – Data containing the columns referenced by the constraints.
Returns:: Copy of X with equality columns removed and inequality columns converted to differences.
Return type:: pd.DataFrame