Search code examples
pythonpandasseabornkde-plasma

Why are Pandas and Seaborn producing different KDE Plot for same data?


I am trying to have a look at the distribution of a variable with the following values..

+-------+-------+
| Value | Count |
+-------+-------+
| 0.0   |   355 |
| 1.0   |   935 |
| 2.0   |     1 |
| 3.0   |     2 |
| 4.0   |     1 |
+-------+-------+

The table continues with values up to 1000 but very sparse (total observations = 1622, almost all the observations fall in 0 or 1)

So when plotting I did:

sns.distplot(kde=True, a = df.loc[(df.class == 1)].variable_of_interest)

Which produces the following red distribution

KDE Plot with seaborn

Seaborn does not capture the initial concentration of values, but shows more "sensibility" to the rest of the values

Then I remembered pd.DataFrame.plot.kde() , so I gave it a try and it produces this plot that captures the concentration

df.loc[(df.class== 1)].variable_of_interest.plot.kde()

Pandas KDE Distribution

Important note: For those who might notice a difference in the X-axis, I did try the seaborn with xlims(-500, 1000) yet the plot remains exactly the same

Do you know why do they generate such different plots? Does it have to do with how they process data, or I am doing something wrong?

Thank you very much in advance!


Solution

  • What's going wrong is that the kde is primarily meant for continuous data, while you seem to be working with discrete data. An important parameter is the bandwidth: the smaller it is, the closer the curve fits to the data, the wider the better to indicate a general form.

    It seems seaborn and pandas are using a different approach here to estimate a "good" bandwidth. With seaborn you could set a fixed bandwidth sns.kdeplot(..., bw=0.5) or so. Or seaborn.distplot(..., kde=True, kde_kws={'bw': 0.5}). With pandas df.plot.kde(bw_method=0.5, ...). Note that the "perfect" bandwidth doesn't exist, it depends on the data, the number of samples and about what you already know about the underlying distribution. The default seaborn and pandas choose is just a rule of thumb, that might be useful or not for your data. Future versions will probably use different rules of thumb.

    The following image shows how different bandwidths influences a kdeplot:

    kdeplot different bandwidths