Search code examples
pythonseabornfrequency-distributionscipy.stats

Differences in frequency count: Stats.relfreq vs Seaborn


I'm using Seaborn to plot a relative frequency histogram. Since I havn't found a way to save value associated with the highest peak I used stats.relfreq to do this. However relative frequency does not seem to be matching.

I am using Python in Jupyter Notebook.

My data:

my_data = [0.9995, 0.9995, -0.0803, -0.7736, 0.9418, 0.3612, 0.5023, 0.9686, 0.5574, 0.8629, 0.5226, 0.9947, 0.9947, 0.9947, 0.9947, 0.9947, 0.9947, 0.9947, 0.9947, 0.9947, 0.9947, 0.9947, 0.9947, -0.8391, -0.4767, 0.3612, 0.4215, 0.8176, 0.5106, -0.0772, 0.0865, -0.6739, -0.5574, -0.6776, 0.4588, -0.2263, 0.8224, 0.3804, 0.3804, -0.0516, -0.3818, 0.0325, 0.6341, 0.0516, -0.5859, -0.5106, -0.0258, 0.128, 0.8126, -0.4201, -0.2449, -0.4215, -0.3506, 0.3612, -0.872, -0.872, 0.7506, -0.5719, 0.7003, -0.235, 0.1747, 0.5994, 0.5423, -0.25, 0.8834, 0.1761, -0.7691, 0.6249, 0.7819, -0.34700000000000003, -0.6486, 0.2955, 0.6486, 0.1734, -0.2732, -0.6486, -0.6049, -0.6049, -0.8622, -0.8622, -0.8622, 0.5423, 0.4404, 0.25, 0.25, 0.5106, 0.4404, 0.4404, 0.5519, 0.5519, 0.5583, -0.1027, -0.2732, -0.1027, 0.5423, 0.4939, -0.2144, 0.25, 0.2247, 0.9079, 0.128, -0.7273, -0.4329, 0.8126, 0.2263, -0.5423, 0.5106, -0.7362, 0.34, -0.6115, -0.5994, -0.6697, 0.9201, 0.1027, 0.5922, 0.5922, 0.3822, 0.5667, 0.8316, 0.9679, 0.29600000000000004, 0.3612, 0.5574, 0.3169, 0.3612, -0.9413, -0.9413, 0.5994, 0.6478, 0.4404, 0.29600000000000004]

My code:

from scipy import stats
import seaborn as sns

# Calculate relative frequency of values, using 10 bins.
res = stats.relfreq(points, numbins = 10)
relative_frequency = res.frequency
print(relative_frequency)

#find highest value and corresponding index
highest_val = np.max(relative_frequency)
highest_index = np.where(relative_frequency == highest_val)
highest_index = int(highest_index[0])
print(highest_index)

# Ordered list with possible scores associated to each frequency bin
possible_scores = [-0.9, -0.7, -0.5, -0.3, -0.1, 0.1, 0.3, 0.5, 0.7, 0.9]
averaged_relative_frequency_score = possible_scores[highest_index]
print(averaged_relative_frequency_score)

# Plot histogram with Seaborn
ax = sns.histplot(data = date_result['Score'], stat = 'probability', bins = 10, binwidth = 0.2, binrange = [-1, 1])

plt.xlim(-1,1)
plt.show()

Below is the different output i get.

print(relative_frequency)
#relative_frequency [0.0610687  0.06870229 0.09923664 0.07633588 0.04580153 0.08396947
 0.16793893 0.17557252 0.07633588 0.14503817]

print(highest_index)
# highest index = 7

print(averaged_relative_frequency_score)
# averaged_relative_frequency_score = 0.5

And the Seaborn plot:

Hisogram

As you can tell the corresponding index in the Seaborn plot would be 9 in the frequency calculated with the stats module if everything worked correctly. Are bins sized differently in stats.relfreq compared to Seaborn?

Have I missunderstod anything obvious? I can't seem to understand why I get different peaks with the two methods.

Ciao!


Solution

  • Just after writing this I figured out what was wrong.

    Bins in stats.relfreq are by default a bit oversized.

    To achieve the same result you have to specify the limits of the histogram with the defaultreallimits parameter.

    Implemented in code:

    res = stats.relfreq(points, numbins = 10, defaultreallimits = [-1, 1])