Search code examples
pythonpandasscipycontingency

Why aren't the expected frequencies returned by scipy.ststs.contingency.expected_freq what I expect?


I have a data frame which I am wanting to calculate a chi squared and p-value for. However, when I print out the expected values they are not what I expect. The null hypothesis I was expecting the code to test is that there is no dependence of Q7 on 'ConcernImprovement', so I expected the 'expected frequencies' for decrease, increase and no change to be the same for each Q7 entry

This is my observed data frame which is called LikelihoodConcern:

ConcernImprovement  Decrease  Increase  No change
Q7                                               
Likely                   2.0      18.0       21.0
Not likely at all        0.0       2.0        1.0
Not very likely          3.0      11.0        5.0
Somewhat likely          4.0      24.0       14.0
Very likely              1.0      16.0        8.0

I tried this code:

from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(LikelihoodConcern, correction=False)
expected

It returns this for the expected frequencies:

array([[ 3.15384615, 22.39230769, 15.45384615],
       [ 0.23076923,  1.63846154,  1.13076923],
       [ 1.46153846, 10.37692308,  7.16153846],
       [ 3.23076923, 22.93846154, 15.83076923],
       [ 1.92307692, 13.65384615,  9.42307692]])

I expected it to return:

array([[ 13.67777777, 13.67777777, 13.67777777],
       [ 1.00000000,  1.00000000,  1.00000000],
       [ 6.33333333, 6.33333333,  6.33333333],
       [ 14.00000000, 14.00000000, 14.00000000],
       [ 8.33333333, 8.33333333,  8.33333333]])

I have looked at the source code for the expected_freq function as the documentation doesn't have much detail - but I still don't understand why I am not seeing what I expect


Solution

  • I gave it a test there, with the same input data as you had:

    array([[ 2., 18., 21.],
       [ 0.,  2.,  1.],
       [ 3., 11.,  5.],
       [ 4., 24., 14.],
       [ 1., 16.,  8.]])
    

    and got back the same results that you did for expected frequencies. If we look at the first cell (row 'Likely', column 'Decrease'). The marginal sum for 'Likely' is 42, and for 'Decrease' it is 10. The marginal sum for the table is 130. Thus for the first cell we have an expected value of:

    (10 * 41) / 130 = 3.1538461538461537
    

    For the the bottom right cell (row 'Verly likely', column 'No change') we have:

    (49 * 25) / 130 = 9.423076923076923
    

    etc. These match up with the results from stats.scipy.