I am trying to determine the KL divergence to measure the overlap between two density functions (2dhistograms).
Below is the code I currently have. But the output is a list of numbers and not one value?
import matplotlib.pyplot as plt
import random
import scipy.stats
A_x = [random.randrange(1,100,1) for _ in range (10000)]
A_y = [random.randrange(1,100,1) for _ in range (10000)]
B_x = [random.randrange(1,100,1) for _ in range (100000)]
B_y = [random.randrange(1,100,1) for _ in range (100000)]
fig, ax = plt.subplots()
ax.grid(False)
a,x,y,p = plt.hist2d(A_x,A_y, bins = 100)
b,x,y,p = plt.hist2d(B_x,B_y, bins = 100)
div = scipy.stats.entropy(a, qk= b, base=None)
scipy.stats.entropy
assumes that the distributions are 1-dimensional. Looking at the docstring, you can see:
S = -sum(pk * log(pk), axis=0)
which means it sums over the first axis. Giving it an array of shape (m, n)
will give you a result of shape (n,)
, which is like treating each row of your arrays as a separate pair of distributions.
But the definition of entropy doesn't care about the dimensionality of the distributions. It's just about the probabilities of an event, which in your case is a single element of a
or b
. So you can do:
div = scipy.stats.entropy(a.ravel(), qk=b.ravel(), base=None)
and you'll get a single value for the KL divergence.