python scikit-learn statsmodels kernel-density

How to return mean value (or expectation value) of a distribution estimated via function KernelDensity of sklearn, in python?

My question is, how to return the mean value and variance of the estimated "kde"? Or is there any other package you known that can easily output mean value or variance value, like print kde.mean() or print kde.get_parameter(mean)?

import numpy as np
from scipy.stats import norm
from sklearn.neighbors import KernelDensity

N = 100
np.random.seed(1)
X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)),np.random.normal(5, 1, int(0.7 * N))))[:, np.newaxis]

X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X)

Solution

In general, you need to do this numerically. I suggest 2 different approaches:

Integration
Monte Carlo Simulation

These approaches work for any kernel and any bandwidth.

Integration

Uses the fact that once we know the probability density function, we can easily compute mean and variance via integration.

Note that in scikit-learn the method score_samples returns log pdf and therefore one needs to "exp" it.

Monte Carlo Simulation

The idea here is to simply sample from your KDE and estimate population mean and variance via sample mean and variance.

Code

import numpy as np
from scipy.integrate import quad
from sklearn.neighbors import KernelDensity

N = 100
np.random.seed(1)
X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)),np.random.normal(5, 1, int(0.7 * N))))[:, np.newaxis]

X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]

kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X)

# Mean and Variance - Integration
pdf = lambda x : np.exp(kde.score_samples([[x]]))[0]
mean_integration = quad(lambda x: x * pdf(x), a=-np.inf, b=np.inf)[0]
variance_integration = quad(lambda x: (x ** 2) * pdf(x), a=-np.inf, b=np.inf)[0] - mean_integration ** 2

# Mean and Variance - Monte Carlo
n_samples = 10000000
samples = kde.sample(n_samples)

mean_mc = samples.mean()
variance_mc = samples.var()


print('Mean:\nIntegration: {}\nMonte Carlo: {}\n'.format(mean_integration, mean_mc))
print('Variance\nIntegration: {}\nMonte Carlo: {}\n'.format(variance_integration, variance_mc))

Output:

Mean: Integration: 3.560582852075697 Monte Carlo: 3.5595633705830934

Variance: Integration: 6.645066811078639 Monte Carlo: 6.646732489654485