Search code examples
histogramdistancedistributionloss-functionmultivariate-testing

Calculate total distance between multiple pairwise distributions/histograms


I am not sure about the terminology I should use for my problem, so I will give an example.

I have 2 sets of measurements (6 empirical distributions per set = D1-6) that describe 2 different states of the same system (BLUE & RED). These distributions can be multimodal, skewed, undersampled, and strange in some other unpredictable ways.

BLUE is my reference and I want to make RED distributed as closely as possible to BLUE, for all pairwise distributions. For this, I will play with parameters of my RED system and monitor the RED set of measurements D1-6 trying to make it overlap BLUE perfectly.

I know that I can use Jensen-Shannon or Bhattacharyya distances to evaluate the distance between 2 distributions (i.e. RED-D1 and BLUE-D1, for example). However, I do not know if there exist other metrics that could be applied here to get a global distance between all distributions (i.e. quantify the global mismatch between 2 sets of pairwise distributions). Is that the case ?

I am thinking about building an empirical scoring function that would use all the pairwise Jensen-Shannon distances, but I have no better ideas yet. I believe I can NOT just sum all the JS distances because I would get similar scores in these 2 hypothetical, different cases:

  1. D1-6 are distributed as in my image

  2. RED-D1-5 are a much better fit to BLUE-D1-5, BUT RED-D6 is shifted compared to BLUE-D6

And that would be wrong because I would have missed one important feature of my system. Given these 2 cases, it is better to have D1-6 distributed as in my image (solution 1).

The pairwise match between each distribution is equally important and should be equally weighted (i.e. the match between BLUE-D1 and RED-D1 is as important as the match between BLUE-D2 and RED-D2, etc).

D1-3 has a given range DOM1 of [0, 5] and D4-6 has another range DOM2 of [50, 800]. Diamonds represent the weighted means of BLUE and RED distributions.

enter image description here

Thank you very much for your help!


Solution

  • I ended up using a sum of all pairwise Earth Mover's Distance (EMD, https://en.wikipedia.org/wiki/Earth_mover%27s_distance, also known as Wasserstein metric) as a global metric of distance between all pairwise distributions. This describes the difference or similarity between 2 states of my system in an appropriate way.

    EMD is implemented in python in package 'pyemd' or using scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html.