Search code examples
pythonscipyprobability

scipy.stats cdf greater than 1


I'm using scipy.stats and I need the CDF up to a given value x for some distributions, I know PDFs can be greater than 1 because they are not probabilities but densities so they should integrate to 1 even if specific values are greater, but CDFs should never be greater than 1 and when running the cdf function on scipy.stats sometimes I get values like 2.89, i'm completely sure i'm using cdf and not pdf(that was my first guess), this is messing my results and algorithm because I need accumulated probabilities, why is scipy.stats cdf returning values greater than 1 and/or how should I proceed to fix it?

Code for reproducing the issue with a sample distribution and parameters(but it happens with others too):

from scipy import stats
distribution = stats.gausshyper
params = [9.482986347673158, 16.65813644507513, -38.11083665959626, 16.08698932118982, -13.387170754433273, 18.352117022674125]
test_val = [-0.512720,1,1]

arg = params[:-2]
loc = params[-2]
scale = params[-1]

print("cdf:",distribution.cdf(test_val,*arg, loc=loc,scale=scale))
print("pdf:",distribution.pdf(test_val,*arg, loc=loc,scale=scale))

cdf: [2.68047481 7.2027761 7.2027761 ] pdf: [2.76857133 2.23996739 2.23996739]


Solution

  • The problem lies in the parameters that you have specified for the Gaussian hypergeometric (HG) distribution, specifically in the third element of params, which is the parameter beta in the HG distribution (see equation 2 in this paper for the definiton of the density of the Gauss Hypergeometric distr.). This parameter has to be positive for HG to have a valid density. Otherwise, the density won't integrate to 1, which is exactly what is happening in your example. With a negative beta, the distribution is not a valid probability distribution.

    You can also find the requirement that beta (denoted as b) has to be positive in the scipy documentation here. Changing beta to a positive parameter immediately solves your problem:

    from scipy import stats
    distribution = stats.gausshyper
    params = [9.482986347673158, 16.65813644507513, 38.11083665959626, 16.08698932118982, -13.387170754433273, 18.352117022674125]
    test_val = [-0.512720,1,1]
    
    arg = params[:-2]
    loc = params[-2]
    scale = params[-1]
    
    print("cdf:",distribution.cdf(test_val,*arg, loc=loc,scale=scale))
    print("pdf:",distribution.pdf(test_val,*arg, loc=loc,scale=scale))
    

    Output:

    cdf: [1. 1. 1.]
    pdf: [3.83898392e-32 1.25685346e-35 1.25685346e-35]
    

    ,where all cdfs integrate to 1 as desired. Also note that your x also has to be between 0 and 1, as described in the scipy documentation here.