Search code examples
pythonpytorchdata-analysis

Problem with library for quick Wasserstein distance calculation


I need a tool to quickly calculate the Wasserstein distance between two two-dimensional point sets. I have been using Gudhi, but it appears to be too slow, and I need a faster alternative. I found the geomloss library, which appears to be fast enough, but the results differ, e.g.

from gudhi.wasserstein import wasserstein_distance
import numpy as np

dgm1 = np.array([[2.7, 3.7],[9.6, 14.],[34.2, 34.974]])
dgm2 = np.array([[2.8, 4.45],[9.5, 14.1]])

wasserstein_distance(dgm1, dgm2, order=1)

yields 1.2369999999999965, while

import torch
from geomloss import SamplesLoss

I1 = torch.Tensor(dgm1)
I2 = torch.Tensor(dgm2)
I1.requires_grad_()

loss = SamplesLoss(loss='sinkhorn', debias=False, p=1, blur=1e-3, scaling=0.999, backend='auto')
loss(I1, I2)

yields tensor(12.9882, grad_fn=<SelectBackward0>). I don't expect the two results to match perfectly, but ten-fold difference is a bit too much.

I would highly appreciate if anyone could help me with either forcing the geomloss to yield result similar to the gudhi, or finding any alternative (that gives result similar to the gudhi).


Solution

  • I got comments from the developers on GitHub (please, see this issue for more information). Here is a short summary:

    • what gudhi calls wasserstein is not the same as what geomloss calls Wasserstein, it seems to be discrepancy between OT and TDA communities terminology;
    • gudhi computes a non-standard variation (documented here) that is specifically tailored for persistence diagrams;
    • currently there is no gudhi version of wasserstein implemented in geomloss but it may be possible to see it in the future.