python numpy matplotlib statistics probability-density

How to plot an histogram correctly with numpy, and match it with the density function?

TL;DR: How to plot the result of np.histogram(..., density=True) correctly with Numpy?

Using density=True should help to match the histogram of the sample, and the density function of the underlying random variable, but it doesn't:

import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
y = np.random.randn(10000)
h, bins = np.histogram(y, bins=1000, density=True)
plt.bar(bins[:-1], h)
x = np.linspace(-10, 10, 100)
f = scipy.stats.norm.pdf(x)
plt.plot(x, f, color="green")
plt.show()

Why aren't the histogram and probability density functions scaled accordingly?

In this case, an observation shows that a 1.6 scaling would be better:

plt.plot(x, 1.6 * f, color="green")

Also, this works normally:

plt.hist(y, bins=100, density=True)

Why?

Solution

TL;DR:

The default bar width of .bar() is too big (.hist() adjusts width internally).
The number of bars is too high for the default figsize (that's why 100 bins is OK but 1000 is not).

In Axes.hist, bar widths are computed by np.diff(bins) (source code). Since it allows a multi-dimensional array, there are a whole lot of validation and reshaping done behind the scenes but if we set all that aside, for a 1D array, Axes.hist is just a wrapper around np.histogram and Axes.bar whose (abridged) code looks like the following:

height, bins = np.histogram(x, bins)
width = np.diff(bins)
boffset = 0.5 * width
Axes.bar(bins[:-1]+boffset, height, width)

On the other hand, Axes.bar iteratively adds matplotlib.patches.Rectangle objects to an Axes using a default width of 0.8, (source implementation), so if a bar is particularly tall and subsequent bars are short, the short ones will be hidden behind the tall ones.

The following code (kind of) illustrates the points made above. The histograms are the same; for example the tallest bars are the same. Note that, in the figure below the figsize is 12"x5" and each Axes is roughly 3" wide), so given that the default dpi is 100, it can only show roughly 300 dots horizontally which means it cannot show all 1000 bars correctly. We need an appropriately wide figure to show all bars correctly.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# sample
y = np.random.default_rng(0).standard_normal(10000)
N = 1000
h, bins = np.histogram(y, bins=N, density=True)
x = np.linspace(-5, 5, 100)
f = stats.norm.pdf(x)

# figure
fig, axs = plt.subplots(1, 3, figsize=(12,5))
a0 = axs[0].bar(bins[:-1], h)
a1 = axs[1].bar(bins[:-1]+0.5*np.diff(bins), h, np.diff(bins))
h_hist, bins_hist, a2 = axs[2].hist(y, bins=N, density=True)
for a, t in zip(axs, ['Axes.bar with default width', 'Axes.bar with width\nrelative to bins', 'Axes.hist']):
    a.plot(x, f, color="green")
    a.set_title(t)

# label tallest bar of each Axes
axs[0].bar_label(a0, [f'{h.max():.2f}' if b.get_height() == h.max() else '' for b in a0], fontsize=7)
axs[1].bar_label(a1, [f'{h.max():.2f}' if b.get_height() == h.max() else '' for b in a1], fontsize=7)
axs[2].bar_label(a2, [f'{h_hist.max():.2f}' if b.get_height() == h_hist.max() else '' for b in a2], fontsize=7)

# the bin edges and heights from np.histogram and Axes.hist are the same
(h == h_hist).all() and (bins == bins_hist).all()     # True

For example, if we plot the same figure with figsize=(60,5) (everything else the same), we get the following figure where the bars are correctly shown (in particular Axes.hist and Axes.bar with adjusted widths are the same).