this is my code:
import numpy as np
from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
from numpy import linspace,hstack
from pylab import plot,show,hist
import re
import json
attribute_file="path"
attribute_values = [line.rstrip('\n') for line in open(attribute_file)]
obs=[]
#Assume the list obs as loaded
obs=np.asarray(osservazioni)
obs=np.sort(obs,kind='mergesort')
x_min=osservazioni[0]
x_max=osservazioni[len(obs)-1]
# obtaining the pdf (my_pdf is a function!)
my_pdf = gaussian_kde(obs)
# plotting the result
x = linspace(0,x_max,1000)
plot(x,my_pdf(x),'r') # distribution function
hist(obs,normed=1,alpha=.3) # histogram
show()
new_values = np.asarray([-1, 0, 2, 3, 4, 5, 768])[:, np.newaxis]
for e in new_values:
print (str(e)+" - "+str(my_pdf(e)*100*2))
Problem: The obs array contains a list of all obs. I need to calcolate a score (between 0 and 1) for new values
[-1, 0, 2, 3, 4, 500, 768]
So the value -1 must have a discrete score because it doesn't appears in the distribution but is next to the 1 value that is very common in the observations.
The reason for that is that you have many more 1's in your observations than 768's. So even if -1 is not exactly 1, it gets a high predicted value, because the histogram has a much larger larger value at 1 than at 768.
Up to a multiplicative constant, the formula for prediction is:
where K is your kernel, D your observations and h your bandwitdh. Looking at the doc for gaussian_kde
, we see that if no value is provided for bw_method
, it is estimated in some way, which here doesn't suit you.
So you can try some different values: the larger the bandwidth, the more points far from your new data are taken into account, the limit case being an almost constant predicted function.
On the other hand, a very small bandwidth only takes really close points into account, which is what I thing you want.
Some graphs to illustrate the influence of the bandwidth:
Code used:
import matplotlib.pyplot as plt
f, axarr = plt.subplots(2, 2, figsize=(10, 10))
for i, h in enumerate([0.01, 0.1, 1, 5]):
my_pdf = gaussian_kde(osservazioni, h)
axarr[i//2, i%2].plot(x, my_pdf(x), 'r') # distribution function
axarr[i//2, i%2].set_title("Bandwidth: {0}".format(h))
axarr[i//2, i%2].hist(osservazioni, normed=1, alpha=.3) # histogram
With your current code, for x=-1, the value of K((x-x_i)/h) for all x_i's who are equal to 1 is smaller than 1, but you add up a lot of these values (there are 921 1s in your observations, and also 357 2s)
On the other hand for x = 768, the value of the kernel is 1 for all x_i's which are 768, but there are not many such points (39 to be precise). So here a lot of "small" terms make a larger sum than a small number of larger terms.
If you don't want this behavior, you can decrease the size of your gaussian kernel : this way the penalty (K(-2)) paid because of the distance between -1 and 1 will be higher. But I think that this would be overfitting your observations.
A formula to determine whether a new sample is acceptable (compared to your empirical distribution) or not is more of a statistical problem, you can have a look at stats.stackexchange.com
You can always try to use a low value for the bandwidth, which will give you a peaked predicted function. Then you can normalize this function, dividing it by its maximal value.
After that, all predicted values will be between 0 and 1:
maxDensityValue = np.max(my_pdf(x))
for e in new_values:
print("{0} {1}".format(e, my_pdf(e)/maxDensityValue))