Search code examples
pythonscipyprobability-distribution

About the inputs of the Wasserstein Distance W1


NOTE: I wrote the same question on https://math.stackexchange.com/questions/4765025/about-the-inputs-of-the-wasserstein-distance-w-1, and since I did not get any comment or answer, I post it here, since it can overlap with topics on stack overflow

In math, you calculate the Wasserstein Distance W1 among two probability measures P and Q, by using the CDFs (or the inverse CDFs) of those two probability measures P and Q, i.e. F and G (or F-1 and G-1).

In Python (please see scipy.stats.wasserstein_distance), you use the "Values observed in the (empirical) distributions" as inputs to calculate the Wasserstein Distance W1. Therefore:

  1. What are the "Values observed in the (empirical) distributions" mentioned in Python guidelines as inputs for calculating W1? I mean, do they refer to the empirical estimations of Probability Density Functions, i.e. histograms, or to the empirical Cumulative Distribution Functions (eCDFs)?
  2. How are the inputs used in Python related to the two probability measures P and Q?

Solution

  • Not sure about SciPy and how they compute Wasserstein Distance, I used Python OT (optimal transport) package to compute Wasserstein Distance from samples.

    Internally, I believe, OT computes empirical CDF from samples and then distance as integral.

    Sample code, Python 3.10 Windows x64

    import ot
    import numpy as np
    
    #%%
    
    tab1 = np.random.normal(2, 1, 1000)
    tab2 = np.random.normal(0, 1, 1000)
    
    q = ot.wasserstein_1d(tab1,tab2)
    print(q)
    

    prints value around 2

    Link to the package page https://pythonot.github.io/