Processing and Wrappers¶
- class synthyverse.generators.base.DataProcessor(constraints=None, missing_imputation_method='drop', random_state=0)¶
Bases:
objectReusable tabular pre/postprocessor for synthyverse generators.
The first call to
preprocess()fits the processor state. Later calls reuse the fitted imputers, constraints, precision, dtypes, and column order, allowing multiple generators to share one processor for the same dataset.Use
preprocessbefore fitting a low-level generator, then usepostprocesson generated model-space data to restore the original schema. For single-generator workflows,SynthyverseGeneratorprovides the same behavior as a wrapper.- Parameters:
constraints (list, str) – Optional equality or inequality constraints to enforce in model space and restore after generation. Examples:
"total=part_a+part_b","age>=18", or"income>expenses". Default: None.missing_imputation_method (str) – Missing-value strategy. Options are
"drop","keep","mean","median","most_frequent", and"missforest". Default:"drop".random_state (int) – Random seed used by stochastic preprocessing steps. Default: 0.
Example
>>> import pandas as pd >>> from synthyverse.generators import CTGANGenerator, DataProcessor >>> >>> X = pd.read_csv("data.csv") >>> discrete_features = ["category_col"] >>> >>> processor = DataProcessor( ... constraints=["total=part_a+part_b"], ... missing_imputation_method="median", ... random_state=42, ... ) >>> X_processed = processor.preprocess(X, discrete_features) >>> >>> generator = CTGANGenerator(epochs=300, random_state=42) >>> generator.fit(X_processed, discrete_features) >>> X_syn = processor.postprocess(generator.generate(1000))
- classmethod load(path)¶
Load a persisted processor from disk.
- Parameters:
path (str, Path) – Path to a saved
processor.pklfile or a directory containing one.- Returns:
Restored processor.
- Return type:
- postprocess(X)¶
Transform generated model-space data back to the original schema.
Applies inverse constraints, restores dropped constraint columns, rounds numeric columns to the original precision, restores the original column order, and casts columns back to their original pandas dtypes.
- Parameters:
X (pd.DataFrame) – Generated data in the processor’s model-space schema.
- Returns:
Generated data in the original input schema.
- Return type:
pd.DataFrame
- preprocess(X, discrete_features=None, X_val=None)¶
Fit-if-needed and transform input data for model training.
On the first call, this method records the original schema, fits the missing-value handler, and prepares constraint handling. Later calls reuse that fitted state and only transform data with the same original schema.
- Parameters:
X (pd.DataFrame) – Training or later input data in the original schema.
discrete_features (list) – Names of categorical/discrete columns. Required on the first call and optional after the processor is fitted. Default: None.
X_val (pd.DataFrame) – Optional validation data in the same schema as
X. Default: None.
- Returns:
Processed
XwhenX_valis None, or(X_processed, X_val_processed)when validation data is provided.- Return type:
pd.DataFrame or tuple
- save(path)¶
Persist this processor to disk.
pathmay be either the target file path or a directory. When a directory is provided, the processor is written toprocessor.pkl.- Return type:
None- Parameters:
path (str | Path)
- class synthyverse.generators.base.SynthyverseGenerator(generator, generator_params=None, processor=None, constraints=None, missing_imputation_method='drop', random_state=0, **generator_kwargs)¶
Bases:
objectSynthyverse high-level generator wrapper for tabular data.
Combines a low-level
BaseGeneratorwith the sharedDataProcessorto provide missing-value handling, constraint handling, dtype restoration, column-order restoration, and numeric precision restoration around any Synthyverse generator.The wrapped low-level generator and processor remain available through the
generatorandprocessorattributes for users who want explicit control over each step.- Parameters:
generator (str, type, BaseGenerator) – Generator name,
BaseGeneratorsubclass, or fitted/unfitted generator instance to wrap.generator_params (dict) – Keyword arguments used when
generatoris a name or class. Default: None.processor (DataProcessor) – Optional preconfigured data processor. If provided,
constraintsandmissing_imputation_methodare ignored. Default: None.constraints (list, str) – Optional equality or inequality constraints applied during preprocessing and reversed during postprocessing. Default: None.
missing_imputation_method (str) – Missing-value strategy used by the created
DataProcessor. Options: “drop”, “keep”, “mean”, “median”, “most_frequent”, “missforest”. Default: “drop”.random_state (int) – Random seed used by the created
DataProcessorand by generator classes that acceptrandom_statewhen no value is supplied ingenerator_paramsorgenerator_kwargs. Default: 0.**generator_kwargs – Additional keyword arguments passed to the wrapped generator constructor. These override keys in
generator_params.
Example
>>> import pandas as pd >>> from synthyverse.generators import SynthyverseGenerator >>> >>> # Load data >>> X = pd.read_csv("data.csv") >>> discrete_features = ["category_col"] >>> >>> # Create high-level wrapper around a low-level generator >>> generator = SynthyverseGenerator( ... "ctgan", ... generator_params={"epochs": 300, "batch_size": 500}, ... missing_imputation_method="median", ... random_state=42, ... ) >>> >>> # Fit and generate data in the original schema >>> generator.fit(X, discrete_features) >>> X_syn = generator.generate(1000)
- fit(X, discrete_features=None, X_val=None)¶
Fit the high-level generator to tabular data.
Alias for
train()for consistency with low-level generators.- Parameters:
X (pd.DataFrame) – Training data in the original tabular schema.
discrete_features (list) – Names of categorical/discrete columns in
X. Required when fitting a new processor. Default: None.X_val (pd.DataFrame) – Optional validation data in the same schema as
X. Default: None.
- Returns:
The fitted high-level generator.
- Return type:
- generate(n)¶
Generate synthetic tabular data.
Alias for
sample()for consistency with low-level generators.- Parameters:
n (int) – Number of synthetic rows to generate.
- Returns:
Synthetic data with the original columns, dtypes, and numeric precision restored.
- Return type:
pd.DataFrame
- classmethod load(path)¶
Load a high-level generator wrapper saved with
save().- Parameters:
path (str, Path) – Directory containing the saved wrapper state.
- Returns:
Restored high-level generator wrapper.
- Return type:
- sample(n)¶
Generate synthetic rows and restore the original tabular schema.
- Parameters:
n (int) – Number of synthetic rows to generate.
- Returns:
Synthetic data with the original columns, dtypes, and numeric precision restored.
- Return type:
pd.DataFrame
- save(path)¶
Persist the wrapper, processor, and wrapped generator to a directory.
- Parameters:
path (str, Path) – Directory where the wrapper state should be saved.
- Returns:
Directory containing the saved wrapper state.
- Return type:
Path
- train(X, discrete_features=None, X_val=None)¶
Preprocess data and fit the wrapped low-level generator.
- Parameters:
X (pd.DataFrame) – Training data in the original tabular schema.
discrete_features (list) – Names of categorical/discrete columns in
X. Required when fitting a new processor. Default: None.X_val (pd.DataFrame) – Optional validation data in the same schema as
X. Default: None.
- Returns:
The fitted high-level generator.
- Return type:
- class synthyverse.generators.base.TabularImputer(method='drop', random_state=0)¶
Bases:
objectReusable missing-value transformer for tabular generator inputs.
Handles missing numerical values before data is passed to a low-level generator. It supports dropping rows, keeping missing values, simple imputation, and a MissForest-style iterative imputer. Most users get this through
DataProcessor; use it directly when you only need missing value handling without constraints or schema restoration.- Parameters:
method (str) – Missing-value strategy. Options are
"drop","keep","mean","median","most_frequent", and"missforest". Default:"drop".random_state (int) – Random seed used by stochastic imputers. Default: 0.
Example
>>> import pandas as pd >>> from synthyverse.generators import TabularImputer >>> >>> X = pd.DataFrame({"age": [31, None, 42], "group": ["a", "b", "a"]}) >>> imputer = TabularImputer(method="median", random_state=42) >>> X_imputed, _ = imputer.fit_transform(X, numerical_features=["age"]) >>> X_later = imputer.transform(X)
- fit_transform(X, numerical_features, X_val=None)¶
Fit the imputer on training data and transform train/validation data.
- Parameters:
X (pd.DataFrame) – Training data.
numerical_features (list) – Numeric columns to impute for simple strategies and to inspect for
"drop".X_val (pd.DataFrame) – Optional validation data in the same schema as
X. Default: None.
- Returns:
(X_processed, X_val_processed). The second item isNonewhen no validation data is provided.- Return type:
tuple
- transform(X)¶
Transform data with the fitted missing-value strategy.
- Parameters:
X (pd.DataFrame) – Data in the same schema used during fitting.
- Returns:
Data with missing values handled according to
method.- Return type:
pd.DataFrame
- class synthyverse.generators.base.TabularSchema(column_order, dtypes, precision)¶
Bases:
objectColumn order, dtype, and precision contract for tabular data.
Captures the schema of a real pandas DataFrame so generated or transformed data can be restored to the same column order, dtypes, and numeric precision. This is usually managed by
DataProcessor, but it can be useful directly when you need to restore model-space data yourself.- Parameters:
column_order (list) – Column names in the desired output order.
dtypes (dict) – Mapping from column names to pandas dtypes.
precision (dict) – Mapping from numeric column names to decimal places.
Example
>>> import pandas as pd >>> from synthyverse.generators import TabularSchema >>> >>> X = pd.DataFrame({"age": [31, 42], "score": [0.25, 0.75]}) >>> schema = TabularSchema.from_dataframe(X) >>> restored = schema.restore(X[["score", "age"]]) >>> list(restored.columns) ['age', 'score']
- classmethod from_dataframe(X, numerical_features=None)¶
Create a schema contract from a DataFrame.
- Parameters:
X (pd.DataFrame) – Data whose columns, dtypes, and numeric precision should be captured.
numerical_features (list) – Optional names of numeric columns to inspect for decimal precision. When omitted, numeric columns are inferred from pandas dtypes.
- Returns:
Schema object fitted to
X.- Return type:
- restore(X)¶
Restore column order, numeric precision, and pandas dtypes.
- Parameters:
X (pd.DataFrame) – Data containing the columns captured by the schema.
- Returns:
Restored data with the original schema.
- Return type:
pd.DataFrame
- round_numeric(X)¶
Round numeric columns to the precision captured from real data.
- Parameters:
X (pd.DataFrame) – Data to round.
- Returns:
Copy of
Xwith known numeric columns rounded.- Return type:
pd.DataFrame
- validate_columns(X)¶
Validate that a DataFrame contains every column in the schema.
- Parameters:
X (pd.DataFrame) – Data to validate.
- Raises:
ValueError – If one or more required columns are missing.
- Return type:
None
- class synthyverse.generators.base.ConstraintEnforcer(constraints)¶
Bases:
objectApply simple column constraints before and after generation.
Converts equality and inequality constraints into a model-space representation that is easier for generators to learn. Equalities remove one constrained column before training and reconstruct it after generation. Inequalities store the constrained side as a nonnegative difference and add the expression back during inverse transformation.
This class is used internally by
DataProcessor, but you can use it directly when you want explicit control over constraint transformations.- Parameters:
constraints (list) – Constraint strings. Equalities use
=and inequalities use<,<=,>, or>=. Examples:"total=part_a+part_b","age>=18", and"income>expenses".
Example
>>> import pandas as pd >>> from synthyverse.generators import ConstraintEnforcer >>> >>> X = pd.DataFrame({"part_a": [2], "part_b": [3], "total": [5]}) >>> enforcer = ConstraintEnforcer(["total=part_a+part_b"]) >>> X_model = enforcer.transform(X) >>> X_restored = enforcer.inverse_transform(X_model)
- inverse_transform(X)¶
Restore constrained columns after generation.
- Parameters:
X (pd.DataFrame) – Generated data in the transformed model-space schema.
- Returns:
Copy of
Xwith constrained columns reconstructed.- Return type:
pd.DataFrame
- transform(X)¶
Transform constrained data into model space.
- Parameters:
X (pd.DataFrame) – Data containing the columns referenced by the constraints.
- Returns:
Copy of
Xwith equality columns removed and inequality columns converted to differences.- Return type:
pd.DataFrame