Search code examples
pythonmatplotlibhistogrambinning

Choice of bins for histograms with relatively few datapoints


Consider a plot with multiple histograms in matplotlib like this:

#! /usr/bin/env python3
import matplotlib.pyplot as plt
import random

# Use the same seed for reproducibility.
random.seed(10586)

data1 = [random.gauss(1e-4, 3e-2) for _ in range(10**3)] + [0.3]
data2 = [random.gauss(1e-2, 3e-3) for _ in range(10**3)] + [0.4]
data3 = [0.2]

if __name__ == '__main__':
    plt.xlim(xmin=0, xmax=0.8)
    plt.yscale('log')
    n1, bins1, patches1 = plt.hist(data1, bins='auto', alpha=0.6)
    n2, bins2, patches2 = plt.hist(data2, bins='auto', alpha=0.6)
    n3, bins3, patches3 = plt.hist(data3, bins='auto', alpha=0.6)
    bin_options = ['auto', 'fd', 'doane', 'scott', 'rice', 'sturges', 'sqrt']
    plt.show()

However, the third dataset has only one datapoint, so when we use plt.hist(data3, bins='auto') we get a long bar stretched across the x-range, and can no longer see that its value is 0.2:

stretched out

(This is most apparent with just one datapoint, but it's an issue with e.g. two or three also.)

One way to avoid this it to just re-use the bins of another dataset. For example, for plt.hist(data3, bins=bins1), we can see data3 just fine:

what we want

However, if we use the other data set via bins=bins2, the bins are too narrow and we cannot see data3 at all:

all gone

How can we ensure that a histogram with relatively few points is visible, but still see its value on the x-axis?


Solution

  • To ensure you see a bar, even if it is too narrow to comprise a pixel, you could give it an edgecolor,

    import matplotlib.pyplot as plt
    import random
    random.seed(10586)
    
    data2 = [random.gauss(1e-2, 3e-3) for _ in range(10**3)] + [0.4]
    
    plt.xlim(0, 0.8)
    plt.yscale('log')
    
    n2, bins2, patches2 = plt.hist(data2, bins='auto', alpha=0.6, edgecolor="C0")
    
    plt.show()
    

    enter image description here

    Or use histtype="stepfilled" to create a polygon, because individual bar's aren't distinguishable with that many bins anyways,

    n2, bins2, patches2 = plt.hist(data2, bins='auto', alpha=0.6, histtype="stepfilled")
    

    enter image description here

    The latter also has the advantage of obeying the alpha, which is otherwise not seen due to the overlap of the bars. Also it should be a faster drawing one single shape rather than some 1000 bars.