Search code examples
pythonscikit-learnprobabilitykernel-density

Python--calculate normalized probability of a value given a list of samples


So, as the title says, I'm trying to calculate the probability of a value given a list of samples, preferably normalized so the probability is 0<p<1. I found this answer on the topic from about 6 years ago, which seemed promising. To test it, I implemented the example used in the first reply (edited for brevity):

import numpy as np
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad


# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
data = np.concatenate((5 + np.random.randn(10, 1),
                       10 + np.random.randn(30, 1)))

x = np.linspace(0, 16, 1000)[:, np.newaxis]

# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(data)

# Get probability for range of values
start = 5  # Start of the range
end = 6    # End of the range

probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]

However, this approach throws the following error:

Traceback (most recent call last):
  File "prob test.py", line 44, in <module>
    probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
  File "/usr/lib/python3/dist-packages/scipy/integrate/quadpack.py", line 340, in quad
    retval = _quad(func, a, b, args, full_output, epsabs, epsrel, limit,
  File "/usr/lib/python3/dist-packages/scipy/integrate/quadpack.py", line 448, in _quad
    return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit)
  File "prob test.py", line 44, in <lambda>
    probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
  File "/usr/lib/python3/dist-packages/sklearn/neighbors/_kde.py", line 190, in score_samples
    X = check_array(X, order='C', dtype=DTYPE)
  File "/usr/lib/python3/dist-packages/sklearn/utils/validation.py", line 545, in check_array
    raise ValueError(
ValueError: Expected 2D array, got scalar array instead:
array=5.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

I'm not sure how to reshape the distribution when its already inside the lambda function, and, in any case, I'm guessing this is happening because Scikit-Learn has been updated in the 6 years since this answer was written. What's the best way to work around this issue to get the probability value?

Thanks!


Solution

  • As said in the library:

    score_samples(X): X array-like of shape (n_samples, n_features)

    Therefore, you should pass an array-like and not a scalar:

     probability = quad(lambda x: np.exp(kd.score_samples(np.array([[x]]))), start, end) 
    

    or:

    probability = quad(lambda x: np.exp(kd.score_samples(np.array([x]).reshape(-1,1))), start, end)