Search code examples
pythonmatplotlibcountseabornhistogram

Why are the value of counts in the matplotlib plot and seaborn plot different, and both wrong?


So the dataset that I'm using is tips from seaborn. I wanted to plot a histogram against the total_bill column, and I did that using both seaborn and matlotlib.

This is my matplotlib histogram:

plt.hist(tips_df.total_bill);

enter image description here

And this is my seaborn histogram:

sns.histplot(tips_df.total_bill)

enter image description here

As you can see, around a total_bill of 13, the frequency seems to be maximum. However, in matplotlib it's around 68, while its around 48 in seaborn.

Which are both wrong. Because on typing

tips_df["total_bill"].value_counts().sort_values(ascending=False).head(5)

we get the output

13.42    3
15.69    2
10.34    2
10.07    2
20.69    2

Name: total_bill, dtype: int64

As we can see, the most frequent bill is around 13, but why is the count values on the y-axis wrong?


Solution

  • In a histogram, a "rectangle"'s height represents how many values are in the given range which is in turn described by the width of the rectangle. You can get the width of each rectangle by (max - min) / number_of_rectangles.

    For example, in the matplotlib's output, there are 10 rectangles (bins). Since your data has a minimum around 3 and maximum around 50, each width is around 4.7 units wide. Now, to get the 3rd rectangles range, for example, we start from minimum and add this width until we get there, i.e., 3 + 4.7*2 = 12.4. It then ends at 12.4 + 4.7 = 17.1. So, the counts corresponding to 3rd bin is the number of values in tips_df.total_bill that fall in this range. Let's find it manually:

    >>> tips_df.total_bill.between(12.4, 17.1).sum()
    70
    

    (since I used crude approximations in calculating ranges and omitted precision, it is not exact; but I hope you get the feeling.)

    This so far was to explain why a direct value_counts doesn't match the histogram output directly, because it gives value-by-value counts whereas histogram is about ranges.

    Now, why the different graphs between seaborn & matplotlib? It's because they use different number of bins! If you count, matplotlib has 10 and seaborn has 14. Since you didn't specify bins argument to either of them, they use default values and matplotlib defaults to plt.rcParams["hist.bins"] and seaborn chooses "automatically" (see Notes section here).

    So, we might as well give bins arguments to enforce the same output:

    >>> plt.hist(tips_df.total_bill, bins=10)
    

    enter image description here

    >>> sns.histplot(tips_df.total_bill, bins=10)
    

    enter image description here