DCR¶

class synthyverse.evaluation.privacy.DCR(discrete_features=None, subsample_test_size=True, random_state=0)¶

Bases: object

Registry name: dcr

Distance to Closest Record (DCR) privacy metrics.

Measures whether synthetic records are more often closer to a training record than an independent test record.

Parameters:

discrete_features (list) – List of discrete/categorical feature names. Default: [].
subsample_test_size (bool) – Whether to subsample the training set and synthetic set to the test set size. Prevents biasing DCR due to different sample sizes. If used, multiple iterations of the DCR score are computed and aggregated, to ensure the metric is based on all training and synthetic records. Default: True.
random_state (int) – Random seed for reproducibility. Default: 0.

Example

>>> import pandas as pd
>>> from synthyverse.evaluation import DCR
>>>
>>> # Prepare data
>>> X_train = pd.DataFrame(...)
>>> X_test = pd.DataFrame(...)
>>> X_syn = pd.DataFrame(...)
>>> discrete_features = ["category_col"]
>>>
>>> # Create metric
>>> metric = DCR(
...     discrete_features=discrete_features,
...     subsample_test_size=True
... )
>>>
>>> # Evaluate
>>> results = metric.evaluate(X_train, X_test, X_syn)

evaluate(X_train, X_test, X_syn)¶

Evaluate synthetic data privacy using DCR metrics.

Parameters:

X_train (DataFrame) – Training data as a pandas DataFrame.
X_test (DataFrame) – Test data as a pandas DataFrame.
X_syn (DataFrame) – Synthetic data as a pandas DataFrame.

Returns:

Dictionary with keys:

”dcr.score”: DCR score such that higher scores indicate better privacy
”dcr.train”: Proportion closer to train
”dcr.test”: Proportion closer to test
”dcr.quantile_002”: Proportion closer to train than the 2% test-to-train distance quantile
”dcr.quantile_005”: Proportion closer to train than the 5% test-to-train distance quantile
”dcr.nndr_train”: Mean NNDR score from synthetic records to train records
”dcr.nndr_train_002”: 2% NNDR quantile from synthetic records to train records
”dcr.nndr_train_005”: 5% NNDR quantile from synthetic records to train records
”dcr.nndr_test”: Mean NNDR score from synthetic records to test records
”dcr.nndr_test_002”: 2% NNDR quantile from synthetic records to test records
”dcr.nndr_test_005”: 5% NNDR quantile from synthetic records to test records
”dcr.nndr_ratio”: Mean pointwise ratio of each synthetic row’s train NNDR score to its test NNDR score
”dcr.nndr_ratio_002”: 2% quantile of pointwise synthetic train/test NNDR ratios
”dcr.nndr_ratio_005”: 5% quantile of pointwise synthetic train/test NNDR ratios

Return type:

dict

Raises:

AssertionError – If test set is larger than train set.