Search code examples
pythonscipystatisticschi-squared

P-value from Chi sq test using Scipy


I am computing a test statistic that is distributed as a chi square with 1 degree of freedom. I am also computing P-value corresponding to this using two different techniques from scipy.stats.

I have observations and expected values as numpy arrays.

observation = np.array([  9.21899399e-04,   4.04363991e-01,   3.51713820e-02,
         3.00816946e-03,   1.80976731e-03,   6.46172153e-02,
         8.61549065e-05,   9.41395390e-03,   1.00946008e-03,
         1.25621846e-02,   1.06806251e-02,   6.66856795e-03,
         2.67380732e-01,   0.00000000e+00,   1.60859798e-02,
         3.63681803e-01,   1.06230978e-05])

expectation = np.array([ 0.07043956,  0.07043956,  0.07043956,  0.07043956,  0.07043956,
        0.07043956,  0.07043956,  0.07043956,  0.07043956,  0.07043956,
        0.07043956,  0.07043956,  0.07043956,  0.07043956,  0.07043956,
        0.07043956,  0.07043956])

For the first approach, I referred to this stackoverflow post. Following is what I am doing in the first approach:

from scipy import stats

chi_sq = np.sum(np.divide(np.square(observation - expectation), expectation)) 
p_value = 1 - stats.chi2.cdf(chi_sq, 1)

print(chi_sq, p_value)

>> (4.1029225303927959, 0.042809154353783851)

In the second approach, I am using chi-square method from spicy.stats. More specifically, I am using this link. This is how I am implementing the second method.

from scipy import stats
print( stats.chisquare(f_obs=observation, f_exp=expectation, ddof=0) )

>> Power_divergenceResult(statistic=4.1029225303927959, pvalue=0.99871467077385223)

I get the same value of chi square statistic in both the methods (i.e. statistic=4.1029225303927959), but different p-values. In the first approach, I get p_value=0.042809154353783851. In the second approach, I get pvalue=0.99871467077385223.

Why am I not getting the same p-values in both the approaches? Thanks.


Solution

  • For stats.chisquare, ddof is defined as

    ddofint, optional
    “Delta degrees of freedom”: adjustment to the degrees of freedom for the p-value. 
    The p-value is computed using a chi-squared distribution with 
    k - 1 - ddof degrees of freedom, 
    where k is the number of observed frequencies. The default value of ddof is 0.
    

    What you are doing is basically a Pearson chi-square test and the degree of freedom is k-1 , where n is the number of observations. From what I can see, your expectation is basically the mean of the observed, meaning you estimated 1 parameter so ddof is correct at 0. But for stats.chi2.cdf, df should be 16.

    So:

    chi_sq = np.sum(np.divide(np.square(observation - expectation), expectation)) 
    [1 - stats.chi2.cdf(chi_sq, len(observation)-1),
    stats.chisquare(f_obs=observation, ddof=0)[1]]
    
    [0.9987146707738522, 0.9987146706997099]
    

    A small difference but the scale is more or less correct..