Search code examples
pythonscikit-learnkernel-density

How to normalize kde of scikit learn?


Let's say I have an array of shape (100000,1), representing samples of variable X of uniform distribution between 0 and 1. I want to approximate the density of probability of this variable, and I use Scikit-Learn KernelDensity to do that.

The problem is I only get a result which is not normalized. The integral of probability density does not sum to 1. How should I do to normalize automatically ? Am I doing something wrong ?

def kde_sklearn(data, grid, **kwargs):
    """
    Kernel Density Estimation with Scikit-learn

    Parameters
    ----------
    data : numpy.array
        Data points used to compute a density estimator. It
        has `n x p` dimensions, representing n points and p
        variables.
    grid : numpy.array
        Data points at which the desity will be estimated. It
        has `m x p` dimensions, representing m points and p
        variables.

    Returns
    -------
    out : numpy.array
        Density estimate. Has `m x 1` dimensions
    """
    kde_skl = KernelDensity(**kwargs)
    kde_skl.fit(data)
    # score_samples() returns the log-likelihood of the samples
    log_pdf = kde_skl.score_samples(grid)
    return np.exp(log_pdf) 

X = np.random.uniform(0,1,1000).reshape(-1,1)
X1 = np.linspace(0,1,100)[:,np.newaxis]

kde_sklearn(X,X1,kernel='tophat')

Out[43]: 
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])

I expected to have a vector of 1 since the integral should sum to 1.


Solution

  • The problem isn't with normalization, as I can show from an example. Suppose that I run the following code that fits a KDE to samples from a standard normal distribution:

    import numpy as np
    import sklearn.neighbors as sn
    
    # Sample from a standard normal distribution
    XX = np.random.randn(1000).reshape(-1, 1)
    
    # Fit a KDE
    kde_sklg = sn.KernelDensity()
    kde_sklg.fit(XX)
    
    # Get estimated densities
    XX1 = np.linspace(-4.0, 4.0, 100)[:, np.newaxis]
    gdens = np.exp(kde_sklg.score_samples(XX1))
    

    I can then estimate the area under the PDF with the trapezoid rule as follows:

    my_area = 0.0
    for i in range(1,gdens.shape[0]):
        my_area += 0.5*(gdens[i] + gdens[i-1])*(XX1[i,0] - XX1[i-1,0])
    

    The estimated area (my_area) that I get is about 0.996, pretty close to 1.

    The problem is that your KDE isn't handling the jumps in your uniform PDF that occur at 0 and 1, so it smears them out too much. About half the area under the KDE's estimate of your PDF then ends up beneath those smeared regions. If you replace the value of your X1 with, say, X2 = np.linspace(-1,2,200)[:,np.newaxis], you can see that there is significant density in the parts of the KDE's estimate of the PDF over the intervals [-1,0] and [1,2].