Hi I am plotting three different histograms which have different total frequencies but I want to normalize them such that the frequencies are the same.
As you can see from the picture, the three sets have different total frequencies but I want to normalize them so that they have the same total frequencies but that I want to preserve the proportion of the frequency at each value of the x-axis.
Here's the code I am using to plot the histograms
setA = [22.972972972972972, 0.0, 0.0, 27.5, 25.0, 18.64406779661017, 8.88888888888889, 20.512820512820515, 11.11111111111111, 15.151515151515152, 17.741935483870968, 13.333333333333334, 16.923076923076923, 12.820512820512821, 27.77777777777778, 4.0, 0.0, 15.625, 14.814814814814815, 7.142857142857143, 15.384615384615385, 14.545454545454545, 38.095238095238095, 17.647058823529413, 21.951219512195124, 21.428571428571427, 32.432432432432435, 10.526315789473685, 36.8421052631579, 13.114754098360656, 17.91044776119403, 12.64367816091954, 16.0, 22.727272727272727, 18.181818181818183, 9.523809523809524, 17.105263157894736, 11.904761904761905, 20.58823529411765, 10.714285714285714, 15.686274509803921, 27.5, 16.129032258064516, 21.333333333333332, 40.90909090909091, 11.904761904761905, 13.157894736842104]
setB = [1.492537313432836, 3.5714285714285716, 17.94871794871795, 11.363636363636363, 13.513513513513514, 14.285714285714286, 15.686274509803921, 17.94871794871795, 9.090909090909092, 41.07142857142857, 10.714285714285714, 25.0, 20.0, 40.0, 13.333333333333334, 13.793103448275861, 3.5714285714285716, 17.073170731707318, 25.675675675675677, 15.625, 17.46031746031746, 8.333333333333334, 18.64406779661017, 14.285714285714286, 0.0, 6.0606060606060606, 6.976744186046512, 18.181818181818183, 26.785714285714285, 22.80701754385965, 6.666666666666667, 12.5]
setC = [13.846153846153847, 23.076923076923077, 25.0, 10.714285714285714, 16.666666666666668, 9.75609756097561, 10.0, 10.0, 17.857142857142858, 20.0, 9.75609756097561, 26.470588235294116, 12.5, 13.333333333333334, 4.3478260869565215, 5.882352941176471, 14.545454545454545, 13.333333333333334, 8.571428571428571, 11.764705882352942, 0.0]
plt.figure('sets')
n, bins, patches = plt.hist(setA, 20, alpha=0.40 , label = 'setA')
n, bins, patches = plt.hist(setB, 20, alpha=0.40 , label = 'setB')
n, bins, patches = plt.hist(setC, 20, alpha=0.40 , label = 'setC')
plt.xlabel('Set')
plt.ylabel('Frequency')
plt.title('Different Sets that need to be normalised')
plt.legend()
plt.grid(True)
plt.show()
As a plus, because my aim is to be able to compare the distribution of the three sets, is there a better visual of the histogram I can use to compare them better graphically.
You could normalise the histograms using the normed=True
option. This will mean that the area of all histograms will add up to 1.
You could also make the plot look a bit tidier by using the same fixed bins for all three histograms (using the bins
option to hist
: bins = np.arange(0,48,2)
, for example).
Try this:
import numpy as np
...
mybins = np.arange(0,48,2)
n, bins, patches = plt.hist(setA, bins=mybins, alpha=0.40 , label = 'setA', normed=True)
n, bins, patches = plt.hist(setB, bins=mybins, alpha=0.40 , label = 'setB', normed=True)
n, bins, patches = plt.hist(setC, bins=mybins, alpha=0.40 , label = 'setC', normed=True)
Another option is to plot all three histograms in one call to plt.hist, in which case you can used the stacked=True
option, which can further clean up your plot.
Note: this method normalises all three histograms, so the total integral is 1. It does not make all three histograms add up to the same value.
n, bins, patches = plt.hist([setA,setB,setC], bins=mybins,
label = ['setA','setB','setC'],
normed=True, stacked=True)
Or, finally, if a stacked histogram is not to your taste, you can plot the bars next to each other, by again plotting all three histograms in one call, but removing the stacked=True
option from the line above:
n, bins, patches = plt.hist([setA,setB,setC], bins=mybins,
label = ['setA','setB','setC'],
normed=True)
As discussed in comments, when used stacked=True
, the normed
option just means the sum of all three histograms will equal 1, so they may not be normalised in the same way as in the other methods.
To counter this, we can use np.histogram
, and plot the results using plt.bar
.
For example, using the same data sets from above:
mybins = np.arange(0,48,2)
nA,binsA = np.histogram(setA,bins=mybins,normed=True)
nB,binsB = np.histogram(setB,bins=mybins,normed=True)
nC,binsC = np.histogram(setC,bins=mybins,normed=True)
# Since the sum of each of these will be 1., lets divide by 3.,
# so the sum of the stacked histogram will be 1.
nA/=3.
nB/=3.
nC/=3.
# Use bottom= to set where the bars should begin
plt.bar(binsA[:-1],nA,width=2,color='b',label='setA')
plt.bar(binsB[:-1],nB,width=2,color='g',label='setB',bottom=nA)
plt.bar(binsC[:-1],nC,width=2,color='r',label='setC',bottom=nA+nB)