I created a dataframe to compare the proportions of three or more groups (Real data has more than 50,000 rows). In the left column, 0 indicates survival, 1 indicates death, and 0,1,2,3 in the right column indicates the grade.
In the example dataframe, the proportions by grade do not seem to differ, but I want to derive the p-value whether this is actually statistically significant or not.
The survival rates obtained from the examples are as follows.
grade 0 57.14%
grade 1 66.66%
grade 2 50.0%
grade 3 60.0%
I tried to use both Kai and Anova tests. However, I don't know exactly which method is correct.
ex_df = pd.DataFrame({"Survive":[0,0,0,0,0,0,0,1,1,1,0,1,1,1,0,0,0,1,1,1,0,0,0,1],
"grade":[2,1,2,3,0,0,0,3,0,2,1,1,0,0,0,1,2,3,1,2,1,3,3,2]})
I want to calculate the p-value
p-value : 0.xxxx ....
This is a problem which would best be analyzed with logistic regression. (The proposed method of using a binomial test of grade groups against the entire samle is incorrect statistically.) Using R (since I'm not a pythonista) it's faily easily demonstrated, and I suspect that there is a python analog. The R results can be used to check the correctness of any Python implementation. As you can see the dataframe structure in Pandas was copied from R as were many of its statistical routines:
ex_df = data.frame(Survive = c(0,0,0,0,0,0,0,1,1,1,0,1,1,1,0,0,0,1,1,1,0,0,0,1),
grade=factor(c(2,1,2,3,0,0,0,3,0,2,1,1,0,0,0,1,2,3,1,2,1,3,3,2)) )
glm(Survive~grade, data=ex_df, family="binomial")
#--- output---
Call: glm(formula = Survive ~ grade, family = "binomial", data = ex_df)
Coefficients:
(Intercept) grade1 grade2 grade3
-0.2877 -0.4055 0.2877 -0.1178
Degrees of Freedom: 23 Total (i.e. Null); 20 Residual
Null Deviance: 32.6
Residual Deviance: 32.25 AIC: 40.25
#----------
summary( glm(Survive~grade, data=ex_df, family="binomial") )
#-------output------
Call:
glm(formula = Survive ~ grade, family = "binomial", data = ex_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1774 -1.0579 -0.9005 1.3018 1.4823
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2877 0.7638 -0.377 0.706
grade1 -0.4055 1.1547 -0.351 0.725
grade2 0.2877 1.1180 0.257 0.797
grade3 -0.1178 1.1902 -0.099 0.921
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 32.601 on 23 degrees of freedom
Residual deviance: 32.247 on 20 degrees of freedom
AIC: 40.247
Number of Fisher Scoring iterations: 4