I am trying to make a very simple histogram with matplotlib.pyplot.hist, and it seems not to be counting properly the number of values in each bin. Here is my code:
import numpy as np
import matplotlib.pyplot as plt
plt.hist([.2,.3,.5,.6],bins=np.arange(0,1.1,.1))
I am dividing the interval [0,1] in bins of width .1, so I should get four bars of height 1. But the output figure consists of only two bars of height 2: it is counting the .3 value as part of the [.2,.3) bin and, similarly, it is counting the .6 value as part of the [.5,.6) bin. I have tried it both on Spyder and Google Colab. Anyone knows what's going on? Thanks!
The problem is that the values fall just on the boundaries of the bins. Floating point rounding can put them in either the previous or the next bin. You need bin boundaries nicely in-between the data points. Note that matplotlib's histogram is primarily meant for continuous distributions where floating point rounding doesn't have such large effects.
Here is some code to illustrate what's happening in both situations:
import numpy as np
import matplotlib.pyplot as plt
data = [.2, .3, .5, .6]
fig, axes = plt.subplots(ncols=2, figsize=(12, 4))
for ax in axes:
if ax == axes[0]:
bins = np.arange(0, 1.1, .1)
ax.set_title('data on bin boundaries')
else:
bins = np.arange(-0.05, 1.1, .1)
ax.set_title('data between bin boundaries')
values, bin_bounds, bars = ax.hist(data, bins=bins, alpha=0.3)
ax.vlines(bin_bounds, 0, max(values), color='crimson', ls=':')
ax.scatter(data, np.full_like(data, 0.5), color='lime', s=30)
ax.set_ylim(0, 2.2)
ax.set_yticks(range(3))
plt.show()