Let's say I have an array of shape (100000,1), representing samples of variable X of uniform distribution between 0 and 1. I want to approximate the density of probability of this variable, and I use Scikit-Learn KernelDensity to do that.
The problem is I only get a result which is not normalized. The integral of probability density does not sum to 1. How should I do to normalize automatically ? Am I doing something wrong ?
def kde_sklearn(data, grid, **kwargs):
"""
Kernel Density Estimation with Scikit-learn
Parameters
----------
data : numpy.array
Data points used to compute a density estimator. It
has `n x p` dimensions, representing n points and p
variables.
grid : numpy.array
Data points at which the desity will be estimated. It
has `m x p` dimensions, representing m points and p
variables.
Returns
-------
out : numpy.array
Density estimate. Has `m x 1` dimensions
"""
kde_skl = KernelDensity(**kwargs)
kde_skl.fit(data)
# score_samples() returns the log-likelihood of the samples
log_pdf = kde_skl.score_samples(grid)
return np.exp(log_pdf)
X = np.random.uniform(0,1,1000).reshape(-1,1)
X1 = np.linspace(0,1,100)[:,np.newaxis]
kde_sklearn(X,X1,kernel='tophat')
Out[43]:
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
I expected to have a vector of 1 since the integral should sum to 1.
The problem isn't with normalization, as I can show from an example. Suppose that I run the following code that fits a KDE to samples from a standard normal distribution:
import numpy as np
import sklearn.neighbors as sn
# Sample from a standard normal distribution
XX = np.random.randn(1000).reshape(-1, 1)
# Fit a KDE
kde_sklg = sn.KernelDensity()
kde_sklg.fit(XX)
# Get estimated densities
XX1 = np.linspace(-4.0, 4.0, 100)[:, np.newaxis]
gdens = np.exp(kde_sklg.score_samples(XX1))
I can then estimate the area under the PDF with the trapezoid rule as follows:
my_area = 0.0
for i in range(1,gdens.shape[0]):
my_area += 0.5*(gdens[i] + gdens[i-1])*(XX1[i,0] - XX1[i-1,0])
The estimated area (my_area
) that I get is about 0.996, pretty close to 1.
The problem is that your KDE isn't handling the jumps in your uniform PDF that occur at 0 and 1, so it smears them out too much. About half the area under the KDE's estimate of your PDF then ends up beneath those smeared regions. If you replace the value of your X1
with, say, X2 = np.linspace(-1,2,200)[:,np.newaxis]
, you can see that there is significant density in the parts of the KDE's estimate of the PDF over the intervals [-1,0] and [1,2].