Search code examples
pythonscikit-learnkernel-density

how does 2d kernel density estimation in python (sklearn) work?


I am sorry for the probably stupid question but I am trying now for hours to estimate a density from a set of 2d data. Let's assume my data is given by the array: sample = np.random.uniform(0,1,size=(50,2)) . I just want to use scipys scikit learn package to estimate the density from the sample array (which is here of course a 2d uniform density) and I am trying the following:

import numpy as np
from sklearn.neighbors.kde import KernelDensity
from matplotlib import pyplot as plt
sp = 0.01

samples = np.random.uniform(0,1,size=(50,2))  # random samples
x = y = np.linspace(0,1,100)
X,Y = np.meshgrid(x,y)     # creating grid of data , to evaluate estimated density on

kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(samples) # creating density from samples

kde.score_samples(X,Y) # I want to evaluate the estimated density on the X,Y grid

But the last step always yields the error: score_samples() takes 2 positional arguments but 3 were given

So probably .score_samples cannot take a grid as input, but there no tutorials/docs for the 2d case so I don't know how to fix this issue. It would be really great if someone could help.


Solution

  • Looking at the Kernel Density Estimate of Species Distributions example, you have to package the x,y data together (both the training data and the new sample grid).

    Below is a function that simplifies the sklearn API.

    from sklearn.neighbors import KernelDensity
    
    def kde2D(x, y, bandwidth, xbins=100j, ybins=100j, **kwargs): 
        """Build 2D kernel density estimate (KDE)."""
    
        # create grid of sample locations (default: 100x100)
        xx, yy = np.mgrid[x.min():x.max():xbins, 
                          y.min():y.max():ybins]
    
        xy_sample = np.vstack([yy.ravel(), xx.ravel()]).T
        xy_train  = np.vstack([y, x]).T
    
        kde_skl = KernelDensity(bandwidth=bandwidth, **kwargs)
        kde_skl.fit(xy_train)
    
        # score_samples() returns the log-likelihood of the samples
        z = np.exp(kde_skl.score_samples(xy_sample))
        return xx, yy, np.reshape(z, xx.shape)
    

    This gives you the xx, yy, zz needed for something like a scatter or pcolormesh plot. I've copied the example from the scipy page on the gaussian_kde function.

    import numpy as np
    import matplotlib.pyplot as plt
    
    m1 = np.random.normal(size=1000)
    m2 = np.random.normal(scale=0.5, size=1000)
    
    x, y = m1 + m2, m1 - m2
    
    xx, yy, zz = kde2D(x, y, 1.0)
    
    plt.pcolormesh(xx, yy, zz)
    plt.scatter(x, y, s=2, facecolor='white')
    

    example figure