Search code examples
pythonmatplotlibhistogram

Normalizing Histograms


Hi I am plotting three different histograms which have different total frequencies but I want to normalize them such that the frequencies are the same.

enter image description here

As you can see from the picture, the three sets have different total frequencies but I want to normalize them so that they have the same total frequencies but that I want to preserve the proportion of the frequency at each value of the x-axis.

Here's the code I am using to plot the histograms

setA = [22.972972972972972, 0.0, 0.0, 27.5, 25.0, 18.64406779661017, 8.88888888888889, 20.512820512820515, 11.11111111111111, 15.151515151515152, 17.741935483870968, 13.333333333333334, 16.923076923076923, 12.820512820512821, 27.77777777777778, 4.0, 0.0, 15.625, 14.814814814814815, 7.142857142857143, 15.384615384615385, 14.545454545454545, 38.095238095238095, 17.647058823529413, 21.951219512195124, 21.428571428571427, 32.432432432432435, 10.526315789473685, 36.8421052631579, 13.114754098360656, 17.91044776119403, 12.64367816091954, 16.0, 22.727272727272727, 18.181818181818183, 9.523809523809524, 17.105263157894736, 11.904761904761905, 20.58823529411765, 10.714285714285714, 15.686274509803921, 27.5, 16.129032258064516, 21.333333333333332, 40.90909090909091, 11.904761904761905, 13.157894736842104]
setB = [1.492537313432836, 3.5714285714285716, 17.94871794871795, 11.363636363636363, 13.513513513513514, 14.285714285714286, 15.686274509803921, 17.94871794871795, 9.090909090909092, 41.07142857142857, 10.714285714285714, 25.0, 20.0, 40.0, 13.333333333333334, 13.793103448275861, 3.5714285714285716, 17.073170731707318, 25.675675675675677, 15.625, 17.46031746031746, 8.333333333333334, 18.64406779661017, 14.285714285714286, 0.0, 6.0606060606060606, 6.976744186046512, 18.181818181818183, 26.785714285714285, 22.80701754385965, 6.666666666666667, 12.5]
setC = [13.846153846153847, 23.076923076923077, 25.0, 10.714285714285714, 16.666666666666668, 9.75609756097561, 10.0, 10.0, 17.857142857142858, 20.0, 9.75609756097561, 26.470588235294116, 12.5, 13.333333333333334, 4.3478260869565215, 5.882352941176471, 14.545454545454545, 13.333333333333334, 8.571428571428571, 11.764705882352942, 0.0]

plt.figure('sets')
n, bins, patches = plt.hist(setA, 20, alpha=0.40 , label = 'setA')  
n, bins, patches = plt.hist(setB, 20, alpha=0.40 , label = 'setB')
n, bins, patches = plt.hist(setC, 20, alpha=0.40 , label = 'setC')    
plt.xlabel('Set')
plt.ylabel('Frequency')
plt.title('Different Sets that need to be normalised')

plt.legend()
plt.grid(True)
plt.show()

As a plus, because my aim is to be able to compare the distribution of the three sets, is there a better visual of the histogram I can use to compare them better graphically.


Solution

  • You could normalise the histograms using the normed=True option. This will mean that the area of all histograms will add up to 1.

    You could also make the plot look a bit tidier by using the same fixed bins for all three histograms (using the bins option to hist: bins = np.arange(0,48,2), for example).

    Try this:

    import numpy as np
    
    ...
    
    mybins = np.arange(0,48,2)
    
    n, bins, patches = plt.hist(setA, bins=mybins, alpha=0.40 , label = 'setA', normed=True)  
    n, bins, patches = plt.hist(setB, bins=mybins, alpha=0.40 , label = 'setB', normed=True)
    n, bins, patches = plt.hist(setC, bins=mybins, alpha=0.40 , label = 'setC', normed=True)   
    

    enter image description here


    Another option is to plot all three histograms in one call to plt.hist, in which case you can used the stacked=True option, which can further clean up your plot.

    Note: this method normalises all three histograms, so the total integral is 1. It does not make all three histograms add up to the same value.

    n, bins, patches = plt.hist([setA,setB,setC], bins=mybins, 
                                label = ['setA','setB','setC'], 
                                normed=True, stacked=True)
    

    enter image description here


    Or, finally, if a stacked histogram is not to your taste, you can plot the bars next to each other, by again plotting all three histograms in one call, but removing the stacked=True option from the line above:

    n, bins, patches = plt.hist([setA,setB,setC], bins=mybins, 
                                label = ['setA','setB','setC'], 
                                normed=True)
    

    enter image description here


    As discussed in comments, when used stacked=True, the normed option just means the sum of all three histograms will equal 1, so they may not be normalised in the same way as in the other methods.

    To counter this, we can use np.histogram, and plot the results using plt.bar.

    For example, using the same data sets from above:

    mybins = np.arange(0,48,2)
    
    nA,binsA = np.histogram(setA,bins=mybins,normed=True)
    nB,binsB = np.histogram(setB,bins=mybins,normed=True)
    nC,binsC = np.histogram(setC,bins=mybins,normed=True)
    
    # Since the sum of each of these will be 1., lets divide by 3.,
    # so the sum of the stacked histogram will be 1.
    nA/=3.
    nB/=3.
    nC/=3.
    
    # Use bottom= to set where the bars should begin
    plt.bar(binsA[:-1],nA,width=2,color='b',label='setA')
    plt.bar(binsB[:-1],nB,width=2,color='g',label='setB',bottom=nA)
    plt.bar(binsC[:-1],nC,width=2,color='r',label='setC',bottom=nA+nB)
    

    enter image description here