I have a data frame which I am wanting to calculate a chi squared and p-value for. However, when I print out the expected values they are not what I expect. The null hypothesis I was expecting the code to test is that there is no dependence of Q7 on 'ConcernImprovement', so I expected the 'expected frequencies' for decrease, increase and no change to be the same for each Q7 entry
This is my observed data frame which is called LikelihoodConcern
:
ConcernImprovement Decrease Increase No change
Q7
Likely 2.0 18.0 21.0
Not likely at all 0.0 2.0 1.0
Not very likely 3.0 11.0 5.0
Somewhat likely 4.0 24.0 14.0
Very likely 1.0 16.0 8.0
I tried this code:
from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(LikelihoodConcern, correction=False)
expected
It returns this for the expected frequencies:
array([[ 3.15384615, 22.39230769, 15.45384615],
[ 0.23076923, 1.63846154, 1.13076923],
[ 1.46153846, 10.37692308, 7.16153846],
[ 3.23076923, 22.93846154, 15.83076923],
[ 1.92307692, 13.65384615, 9.42307692]])
I expected it to return:
array([[ 13.67777777, 13.67777777, 13.67777777],
[ 1.00000000, 1.00000000, 1.00000000],
[ 6.33333333, 6.33333333, 6.33333333],
[ 14.00000000, 14.00000000, 14.00000000],
[ 8.33333333, 8.33333333, 8.33333333]])
I have looked at the source code for the expected_freq
function as the documentation doesn't have much detail - but I still don't understand why I am not seeing what I expect
I gave it a test there, with the same input data as you had:
array([[ 2., 18., 21.],
[ 0., 2., 1.],
[ 3., 11., 5.],
[ 4., 24., 14.],
[ 1., 16., 8.]])
and got back the same results that you did for expected frequencies. If we look at the first cell (row 'Likely', column 'Decrease'). The marginal sum for 'Likely' is 42, and for 'Decrease' it is 10. The marginal sum for the table is 130. Thus for the first cell we have an expected value of:
(10 * 41) / 130 = 3.1538461538461537
For the the bottom right cell (row 'Verly likely', column 'No change') we have:
(49 * 25) / 130 = 9.423076923076923
etc. These match up with the results from stats.scipy
.