Search code examples
pythonscipyhistogramentropy

Kullback-Lieber divergence to measure the overlap between two probability functions


I am trying to determine the KL divergence to measure the overlap between two density functions (2dhistograms).

Below is the code I currently have. But the output is a list of numbers and not one value?

import matplotlib.pyplot as plt
import random
import scipy.stats

A_x = [random.randrange(1,100,1) for _ in range (10000)]
A_y = [random.randrange(1,100,1) for _ in range (10000)]

B_x = [random.randrange(1,100,1) for _ in range (100000)]
B_y = [random.randrange(1,100,1) for _ in range (100000)]

fig, ax = plt.subplots()
ax.grid(False)

a,x,y,p = plt.hist2d(A_x,A_y, bins = 100)
b,x,y,p = plt.hist2d(B_x,B_y, bins = 100)      

div = scipy.stats.entropy(a, qk= b, base=None)     

Solution

  • scipy.stats.entropy assumes that the distributions are 1-dimensional. Looking at the docstring, you can see:

    S = -sum(pk * log(pk), axis=0)
    

    which means it sums over the first axis. Giving it an array of shape (m, n) will give you a result of shape (n,), which is like treating each row of your arrays as a separate pair of distributions.

    But the definition of entropy doesn't care about the dimensionality of the distributions. It's just about the probabilities of an event, which in your case is a single element of a or b. So you can do:

    div = scipy.stats.entropy(a.ravel(), qk=b.ravel(), base=None)
    

    and you'll get a single value for the KL divergence.