Search code examples
pythonscikit-learngaussiankernel-density

sklearn: KDE not working for small values


I am struggling to implement the scikit-learn implementation of KDE for small input ranges. The following code works. Increasing the divisor variable to 100 and KDE struggles:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.neighbors import KernelDensity

# make data:
np.random.seed(0)
divisor = 1
gaussian1 = (3 * np.random.randn(1700))/divisor
gaussian2 = (9 + 1.5 * np.random.randn(300)) / divisor
gaussian_mixture = np.hstack([gaussian1, gaussian2])

# illustrate proper KDE with seaborn:
sns.distplot(gaussian_mixture);

enter image description here

# now implement in sklearn:

x_grid = np.linspace(min(gaussian1), max(gaussian2), 200)

kde_skl = KernelDensity(bandwidth=0.5)
kde_skl.fit(gaussian_mixture[:, np.newaxis])
# score_samples() returns the log-likelihood of the samples
log_pdf = kde_skl.score_samples(x_grid[:, np.newaxis])
pdf = np.exp(log_pdf)

fig, ax = plt.subplots(1, 1, sharey=True, figsize=(7, 4))
ax.plot(x_grid, pdf, linewidth=3, alpha=0.5)

enter image description here

Works fine. However, changing the 'divisor' variable to 100 and the scipy and seaborn can handle the smaller data values. Sklearn's KDE cannot with my implementation:

enter image description here

What am I doing wrong and how can I rectify this? I need sklearns implementation of KDE so cannot use another algorithm.


Solution

  • Kernel Density Estimation is called a nonparametric-method, but actually it has a parameter called bandwidth.

    Every application of KDE needs this parameter set!

    When you do the seaborn-plot:

    sns.distplot(gaussian_mixture);
    

    you are not giving any bandwidth and seaborn uses default heuristics (scott or silverman). These are using the data to choose some bandwidth in a dependent way.

    The sklearn-code of you looks like:

    kde_skl = KernelDensity(bandwidth=0.5)
    

    There is a fixed/constant bandwidth! This might give you trouble and might be the reason here. But it's at least something to look at. In general one would combine sklearn's KDE with GridSearchCV as cross-validation tool to select a good bandwidth. In many cases this is slower, but better than those heuristics above.

    Sadly you did not explain why you want to use sklearn's KDE. My personal rating of the 3 popular candidates is statsmodels > sklearn > scipy.