Why do stat_density (R; ggplot2) and gaussian_kde (Python; scipy) differ?

I am attempting to produce a KDE based PDF estimate on a series of distributions that may not be normally distributed.

I like the way ggplot's stat_density in R seems to recognize every incremental bump in frequency, but cannot replicate this via Python's scipy-stats-gaussian_kde method, which seems to oversmooth.

I have set up my R code as follows:

ggplot(test, aes(x=Val, color = as.factor(Class), group=as.factor(Class))) +
             stat_density(geom='line',kernel='gaussian',bw='nrd0' 
                                                            #nrd0='Silverman'
                                                            ,size=1,position='identity')

And my python code is:

kde = stats.gaussian_kde(data.ravel())
kde.set_bandwidth(bw_method='silverman')

Stats docs show here that nrd0 is the silverman method for bw adjust.

Based on the code above, I am using the same kernals (gaussian) and bandwidth methods (Silverman).

Can anyone explain why the results are so different?

Solution

There seems to be disagreement about what Silverman's Rule is. TL;DR - scipy uses a worse version of the rule that only works well on unimodal data that is normally distributed. R uses a better version that is "the best of both worlds" and works "for a wide range of densities".

The scipy docs say that Silverman's rule is implemented as:

def silverman_factor(self):
    return power(self.neff*(self.d+2.0)/4.0, -1./(self.d+4))

Where d is the number of dimensions (1, in your case) and neff is the effective sample size (number of points, assuming no weights). So the scipy bandwidth is (n * 3 / 4) ^ (-1 / 5) (times the standard deviation, computed in a different method).

By contrast, R's stats package docs describe Silverman's method as "0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power", which can also be verified in R code, typing bw.nrd0 in the console gives:

function (x) 
{
    if (length(x) < 2L) 
        stop("need at least 2 data points")
    hi <- sd(x)
    if (!(lo <- min(hi, IQR(x)/1.34))) 
        (lo <- hi) || (lo <- abs(x[1L])) || (lo <- 1)
    0.9 * lo * length(x)^(-0.2)
}

Wikipedia, on the other hand, gives "Silverman's Rule of Thumb" as one of many possible names for the estimator:

1.06 * sigma * n ^ (-1 / 5)

The wikipedia version is equivalent to the scipy version.

All three sources (scipy docs, Wikipedia, and R docs) cite the same original reference: Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman & Hall/CRC. p. 48. ISBN 978-0-412-24620-3. Wikipedia and R specifically cite page 48, whereas scipy's docs do not mention a page number. (I have submitted an edit to Wikipedia to update its page reference to p.45, see below.)

Reading the Silverman paper, on page 45, equation 3.28 is what is used in the Wikipedia article: (4 / 3) ^ (1 / 5) * sigma * n ^ (-1 / 5) ~= 1.06 * sigma * n ^ (-1 / 5). Scipy uses the same method, rewriting (4 / 3) ^ (1 / 5) as the equivalent (3 / 4) ^ (-1 / 5). Silverman describes this method:

While (3.28) will work well if the population really is normally distributed, it may oversmooth somewhat if the population is multimodal... as the mixture becomes more strongly bimodal the formula (3.28) will oversmooth more and more, relative to the optimal choice of smoothing parameter.

The scipy docs reference this weakness, stating:

It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.

However, the Silverman article continues, improving on the method scipy uses to get to the method used by R and Stata. On page 48, we get the equation 3.31:

h = 0.9 * A * n ^ (-1 / 5)
# A defined on previous page, eqn 3.30
A = min(standard deviation, interquartile range / 1.34)

Silverman describes this method as:

The best of both possible worlds... In summary, the choice ([eqn] 3.31) for the smoothing parameter will do very well for a wide range of densities and is trivial to evaluate. For many purposes it will certainly be an adequate choice of window width, and for others it will be a good starting point for subsequent fine-tuning.

So, it seems Wikipedia and Scipy use a simple version of an estimator proposed by Silverman with known weaknesses. R and Stata use a better version.