In the following code, the density=True
returns probability density function at each bin. Now if have to calculate P(x), can I say that hist
is showing probabilities? For example if the first bin's mean value is 0.5 can I say that at x=0.5 probability is hist[0] ? I have to use KL divergence which uses P(x).
x = np.array([0,0,0,0,0,3,3,2,2,2,1,1,1,1,])
hist,bin_edges= np.histogram(x,bins=10,density=True)
When you set density=True
, NumPy
returns a probability density function (lets say p
). Theoretically speaking, p(0.5) = 0
because the probability is defined as the area under the PDF curve. You can read more details about it here. So, if you want to the compute probability you will have to define desired range and sum up all PDF values in this range.
For the KL, I can share my solution for the mutual information computation (which is basically KL):
def mutual_information(x, y, sigma=1):
bins = (256, 256)
# histogram
hist_xy = np.histogram2d(x, y, bins=bins)[0]
# smooth it out for better results
ndimage.gaussian_filter(hist_xy, sigma=sigma, mode='constant', output=hist_xy)
# compute marginals
hist_xy = hist_xy + EPS # prevent division with 0
hist_xy = hist_xy / np.sum(hist_xy)
hist_x = np.sum(hist_xy, axis=0)
hist_y = np.sum(hist_xy, axis=1)
# compute mi
mi = (np.sum(hist_xy * np.log(hist_xy)) - np.sum(hist_x * np.log(hist_x)) - np.sum(hist_y * np.log(hist_y)))
return mi
EDIT: KL could be computed like this (please note that i did not test this!):
def kl(x, y, sigma=1):
# histogram
hist_xy = np.histogram2d(x, y, bins=bins)[0]
# smooth it out for better results
ndimage.gaussian_filter(hist_xy, sigma=sigma, mode='constant', output=hist_xy)
# compute marginals
hist_xy = hist_xy + EPS # prevent division with 0
hist_xy = hist_xy / np.sum(hist_xy)
hist_x = np.sum(hist_xy, axis=0)
hist_y = np.sum(hist_xy, axis=1)
kl = -np.sum(hist_x * np.log(hist_y / hist_x ))
return kl
Also, for the best result, you should compute sigma
with some heuristics, for example A rule-of-thumb bandwidth estimator.