Search code examples
pythonmatplotlibsampling

Frequencies of values in histogram


This is my first post, so please bear with me

Here is the code

plt.figure()
ax1 = plt.subplot()
sample = np.random.normal(loc=0.0, scale=1.0, size=100)
ax1.hist(sample,bins=100)
ax1.set_title('n={}'.format(sample_size))  
print(len(np.unique(sample))) ##outputs 100 as expected

My doubt is if I am giving bins=100 and the number of samples is also 100, so why it doesn't plot bars for every single sample and why the output plot contains frequencies greater than 1?


Solution

  • With default parameters, all bins get the same width. 100 bins means the width of each bin is 1/100th of the total width. The total width goes from smallest to the largest of the list of samples.

    Due to the choice of boundaries, at least one point will end up in the first bin, one in the last bin, but most will end up in the central bins and many of the outermost bins stay empty.

    Having all bins the same width often is desired. A histogram wants to show in which region there are more and where there are less samples, whether there is just one peak or multiple peaks. Generally, to convey interesting information about data, the number of bins should be much less than the number of samples.

    Here is a plot to illustrate what's happening. As 100 bins create a very crowded plot, the example uses just 20 samples and 20 bins. With so few samples, they will be spread out a bit more than with more samples. hist returns 3 arrays: one with the contents of each bin, one with the boundaries between the bins (this is one more than the number of bins) and one with the graphical objects (rectangular patches). The boundaries can be used to show their position.

    import matplotlib.pyplot as plt
    import numpy as np
    
    N = 20
    plt.figure()
    ax1 = plt.subplot()
    sample = np.random.normal(loc=0.0, scale=1.0, size=N)
    bin_values, bin_bounds, _ =  ax1.hist(sample, bins=N, label='Histogram')
    ax1.set_title(f'{len(np.unique(sample))} samples')
    ax1.plot(np.repeat(bin_bounds, 3), np.tile([0, -1, np.nan], len(bin_bounds)), label='Bin boundaries' )
    ax1.scatter(sample, np.full_like(sample, -0.5), facecolor='none', edgecolor='crimson', label='Sample values')
    ax1.axhline(0, color='black')
    plt.legend()
    plt.show()
    

    explanatory plot

    Here is how 100 samples and 100 bins looks like:

    plot with 100 bins