Search code examples
pythonscikit-learnstatsmodelskernel-density

How to return mean value (or expectation value) of a distribution estimated via function KernelDensity of sklearn, in python?


My question is, how to return the mean value and variance of the estimated "kde"? Or is there any other package you known that can easily output mean value or variance value, like print kde.mean() or print kde.get_parameter(mean)?

import numpy as np
from scipy.stats import norm
from sklearn.neighbors import KernelDensity

N = 100
np.random.seed(1)
X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)),np.random.normal(5, 1, int(0.7 * N))))[:, np.newaxis]

X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X)

Solution

  • In general, you need to do this numerically. I suggest 2 different approaches:

    • Integration
    • Monte Carlo Simulation

    These approaches work for any kernel and any bandwidth.

    Integration

    Uses the fact that once we know the probability density function, we can easily compute mean and variance via integration.

    mean and variance

    Note that in scikit-learn the method score_samples returns log pdf and therefore one needs to "exp" it.

    Monte Carlo Simulation

    The idea here is to simply sample from your KDE and estimate population mean and variance via sample mean and variance.


    Code

    import numpy as np
    from scipy.integrate import quad
    from sklearn.neighbors import KernelDensity
    
    N = 100
    np.random.seed(1)
    X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)),np.random.normal(5, 1, int(0.7 * N))))[:, np.newaxis]
    
    X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]
    
    kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(X)
    
    # Mean and Variance - Integration
    pdf = lambda x : np.exp(kde.score_samples([[x]]))[0]
    mean_integration = quad(lambda x: x * pdf(x), a=-np.inf, b=np.inf)[0]
    variance_integration = quad(lambda x: (x ** 2) * pdf(x), a=-np.inf, b=np.inf)[0] - mean_integration ** 2
    
    # Mean and Variance - Monte Carlo
    n_samples = 10000000
    samples = kde.sample(n_samples)
    
    mean_mc = samples.mean()
    variance_mc = samples.var()
    
    
    print('Mean:\nIntegration: {}\nMonte Carlo: {}\n'.format(mean_integration, mean_mc))
    print('Variance\nIntegration: {}\nMonte Carlo: {}\n'.format(variance_integration, variance_mc))
    

    Output:

    Mean: Integration: 3.560582852075697 Monte Carlo: 3.5595633705830934

    Variance: Integration: 6.645066811078639 Monte Carlo: 6.646732489654485