Search code examples
pythonmatplotlibseaborn

Difference between histplot and pyplot?


I have a csv file named Price which only has column 'phil' and 7412 rows as in: image. I use histplot and plot to draw a normal distribution with the code:

df4 = pd.read_csv(r'C:\Users\ThuyNT13\Desktop\price.csv')

sns.histplot(df4['phil'], color='red',kde= True, stat = 'density')


mean4 = statistics.mean(df4['phil'])
sd4 = statistics.stdev(df4['phil'])

pdf4 = norm.pdf(df4['phil'].sort_values(), mean4, sd4)
plt.plot(df4['phil'].sort_values(), pdf4, label = 'Philippines', color = 'blue')
plt.ticklabel_format(style= 'plain')
plt.show()

The result show different curves with different patterns: image. Why are there differences and what is the meaning of each curve?


Solution

  • kde (in red) is just the smoothing of the distribution density. So, since you have quite large data, it is more or less the same as the histogram (with an obvious resolution shortcut, but it follows the histogram).

    The pdf you compute is the one of the normal law whose mean is the mean of your data and standard deviation is the one of your data.

    Both curve would be the same (roughly) if your data was indeed abiding a normal law.

    import seaborn
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    import scipy.stats
    
    df=pd.DataFrame({'phil':np.random.normal(3500, 500, 10000)})
    seaborn.histplot(df.phil, color='red', kde=True, stat='density')
    μ,σ=df.phil.mean(), df.phil.std()
    phsort=df.phil.sort_values()
    pdf=scipy.stats.norm.pdf(phsort, μ, σ)
    plt.plot(phsort, pdf)
    plt.show()
    

    enter image description here

    Histogram show the actual distribution of the data. Red curve is a smoothed version of the same thing. Blue curve is the density probability of the normal law (with same mean and standard deviation). And since the data happens to have been drawn following a normal law, unsurprisingly, blue curve fit well the data (the actual μ and σ for this curve are 3495, 505, which is well within what is expected when you draw 10000 numbers with normal law(3500,500))

    Now, let's do the same thing with a not normal at all law

    df=pd.DataFrame({'phil':np.random.uniform(500, 7000, 10000)})
    seaborn.histplot(df.phil, color='red', kde=True, stat='density')
    μ,σ=df.phil.mean(), df.phil.std()
    phsort=df.phil.sort_values()
    pdf=scipy.stats.norm.pdf(phsort, μ, σ)
    plt.plot(phsort, pdf)
    plt.show()
    

    enter image description here

    Same as before: histogram is the distribution of actual data (draws uniformly between 500 and 7000). Red curve is just a smoothed version of that. And blue curve is the normal law for μ=3775 (mean of my uniform data) and σ=1876 (standard distribution of my data).

    Which of course, doesn't fit at all the data: same mean and std, sure. But one is the normal law, the other is not.

    Same goes for your data: obviously they are not following normal law. So your histogram and your red curve follow the distribution of your data. The blue curve follow the distribution of what would be the data if they were following the normal law, with same mean and standard deviation. You can see how long is the right tail of your data compared to the left tail. Very big values, even if not numerous, skew the mean to the right. (A little bit like when you compare median income with mean income: mean is artificially high because of a few superrich, who are not numerous, but very rich). So unsurprisingly, the normal law (which is symmetric, when your data are not) have a mean more the the right. And therefore, it has also a bigger standard deviation than the actual concentration of your data may suggest, because of all those big values. Obviously, most of the data must fit in 2 standard deviation interval from the mean. But since the mean has been shifted from the main group, well, standard deviation has to be big. Hence a normal distribution that has a wider distribution area (both have an infinite one, of course. But a larger "95% interval". Plus, your data, since they are called "price" can't obviously go to negative value. Where as, for a normal law, since you have a peak at 10000 or so, but values at 200000, that is 190000 after the peak, you should have, from normal law point of view, as many data at -180000. Or, more realisticly, since you have, not majority but not exceptional, values around 50000, you should have the same amount of -30000. So, normal law is more wide, and therefore with a less high peak, since the total should be the same (area under curves are the same, and 1).

    So, long story short: your data are not following normal law, so, unsurprisingly, the normal law density curve doesn't look like your data density curve.