Wasserstein

class synthyverse.evaluation.fidelity.Wasserstein(discrete_features=None, blur=0.001, scaling=0.5, debias=True, backend='online')

Bases: object

Registry name: wasserstein

Multivariate Wasserstein distance between real and synthetic samples.

Sinkhorn approximation of the Wasserstein-1 distance using a Gower-like cost function.

Parameters:
  • discrete_features (list) – List of discrete/categorical feature names. Default: [].

  • blur (float) – Entropic regularization scale passed to GeomLoss SamplesLoss. Smaller values are closer to exact optimal transport but can be slower or less stable. Default: 0.001.

  • scaling (float) – GeomLoss epsilon-scaling ratio. Default: 0.5.

  • debias (bool) – Whether to use the debiased Sinkhorn divergence form, which returns zero for identical empirical distributions. Default: True.

  • backend (str) – GeomLoss backend. The default “online” backend streams pairwise costs through KeOps and avoids materializing the full sample-by-sample cost matrix. Use “tensorized” only for small datasets or debugging. Default: “online”.

Example

>>> import pandas as pd
>>> from synthyverse.evaluation import Wasserstein
>>>
>>> metric = Wasserstein(discrete_features=["category_col"])
>>> results = metric.evaluate(X_train, X_syn)
evaluate(X_train, X_syn)

Evaluate synthetic data using multivariate Wasserstein distance.

Parameters:
  • X_train (DataFrame) – Real training data as a pandas DataFrame.

  • X_syn (DataFrame) – Synthetic data as a pandas DataFrame.

Returns:

Dictionary with key:
  • ”wasserstein.w1”: Wasserstein-1 distance with L1 ground cost

Return type:

dict