Search code examples
pythonnumpymatplotlibhistogram

Matplotlib histogram misplaced and missing bars


I have large data files and thus am using numpy histogram (same as used in matplotlib) to manually generate histograms and update them. However, at plotting, I feel that the graph is shifted.

This is the code I use to manually create and update histograms in batches. Note that all histograms share the same bins.

temp = np.histogram(batch, bins=np.linspace(0, 40, 41))
hist += temp[0]

The code above is repeated as I parse the data files. For example, a small data set would have the following as the final histogram data:

[8190, 666, 278, 145, 113, 83, 52, 48, 45, 44, 45, 29, 28, 45, 29, 15, 16, 10, 17, 7, 15, 6, 10, 7, 3, 5, 7, 4, 2, 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 29]

Below is the plotting code.

import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import numpy as np
plt.xticks(np.linspace(0, 1, 11))
plt.hist([i/40 for i in range(40)], bins=np.linspace(0, 1, 41), weights=scores, rwidth=0.7)
plt.yscale('log', nonposy='clip')

The resulting figure is quite strange. It shows no bar at [0.475, 0.5) and I expect the 0.975 bin which is range [0.975, 1.0] to include the last 29 values. However instead, I see that bar at the [0.950, 0.975) position. I thought this might have to do with using bins and linspace, but the size of the decoy array and weights are the same.

enter image description here

I'm never seen this kind of behavior. I also thought it would be the way the ranges are [ x, x+width), but I haven't had issues with this.

A note on using linspace. It specifies edges, so 40 bins is specified by 41 edges.

In [2]: np.linspace(0,1,41)                                                     
Out[2]: 
array([0.   , 0.025, 0.05 , 0.075, 0.1  , 0.125, 0.15 , 0.175, 0.2  ,
       0.225, 0.25 , 0.275, 0.3  , 0.325, 0.35 , 0.375, 0.4  , 0.425,
       0.45 , 0.475, 0.5  , 0.525, 0.55 , 0.575, 0.6  , 0.625, 0.65 ,
       0.675, 0.7  , 0.725, 0.75 , 0.775, 0.8  , 0.825, 0.85 , 0.875,
       0.9  , 0.925, 0.95 , 0.975, 1.   ])

In [3]: len(np.linspace(0,1,41))                                                
Out[3]: 41

Solution

  • It seems you're using plt.hist with the idea to put one value into each bin, so simulating a bar plot. As the x-values fall exactly on the bin bounds, due to rounding they might end up in the neighbor bin. That could be mitigated by moving the x-values half a bin width. The simplest is drawing the bars directly.

    The following code creates a bar plot with the given data, with each bar at the center of the region it represents. As a check, the bars are measured again at the end and their height displayed.

    from  matplotlib.ticker import MultipleLocator
    import matplotlib.pyplot as plt
    import numpy as np
    
    scores =[8190,666,278,145,113,83,52,48,45,44,45,29,28,45,29,15,16,10,17,7,15,6,10,7,3,5,7,4,2,3,0,1,0,0,0,0,0,0,0,29]
    binbounds = np.linspace(0, 1, 41)
    rwidth = 0.7
    width = binbounds[1] - binbounds[0]
    bars = plt.bar(binbounds[:-1] + width / 2, height=scores, width=width * rwidth, align='center')
    plt.gca().xaxis.set_major_locator(MultipleLocator(0.1))
    plt.gca().xaxis.set_minor_locator(MultipleLocator(0.05))
    plt.yscale('log', nonposy='clip')
    for rect in bars:
        x, y = rect.get_xy()
        w = rect.get_width()
        h = rect.get_height()
        plt.text(x + w / 2, h, f'{h}\n', ha='center', va='center')
    plt.show()
    

    resulting plot

    PS: To see what's happening with the original histogram, just do a test plot without the weights:

    plt.hist([i/40 for i in range(40)], bins=np.linspace(0, 1, 41), rwidth=1, ec='k')
    plt.plot([i/40 for i in range(40)], [0.5] * 40, 'ro')
    plt.xticks(np.linspace(0, 1, 11))
    

    A red dot shows where the x-values are. Some fall into the correct bin, some into the neighbor which suddenly gets 2 values. histogram without weights

    To create a histogram with the x-values at the center of each bin:

    plt.hist([i/40 + 1/80 for i in range(40)], bins=np.linspace(0, 1, 41), rwidth=1, ec='k')
    plt.plot([i/40 + 1/80 for i in range(40)], [0.5] * 40, 'ro')
    plt.xticks(np.linspace(0, 1, 11))
    plt.yticks([0, 1])
    

    x-values in center of bin