python statistics logistic-regression glm anova

Compare proportions and Statistical significance between more than 2 groups

I created a dataframe to compare the proportions of three or more groups (Real data has more than 50,000 rows). In the left column, 0 indicates survival, 1 indicates death, and 0,1,2,3 in the right column indicates the grade.

In the example dataframe, the proportions by grade do not seem to differ, but I want to derive the p-value whether this is actually statistically significant or not.

The survival rates obtained from the examples are as follows.

grade 0 57.14%

grade 1 66.66%

grade 2 50.0%

grade 3 60.0%

I tried to use both Kai and Anova tests. However, I don't know exactly which method is correct.

ex_df = pd.DataFrame({"Survive":[0,0,0,0,0,0,0,1,1,1,0,1,1,1,0,0,0,1,1,1,0,0,0,1],
              "grade":[2,1,2,3,0,0,0,3,0,2,1,1,0,0,0,1,2,3,1,2,1,3,3,2]})

I want to calculate the p-value

p-value : 0.xxxx ....

Solution

This is a problem which would best be analyzed with logistic regression. (The proposed method of using a binomial test of grade groups against the entire samle is incorrect statistically.) Using R (since I'm not a pythonista) it's faily easily demonstrated, and I suspect that there is a python analog. The R results can be used to check the correctness of any Python implementation. As you can see the dataframe structure in Pandas was copied from R as were many of its statistical routines:

ex_df = data.frame(Survive = c(0,0,0,0,0,0,0,1,1,1,0,1,1,1,0,0,0,1,1,1,0,0,0,1),
               grade=factor(c(2,1,2,3,0,0,0,3,0,2,1,1,0,0,0,1,2,3,1,2,1,3,3,2)) )
 glm(Survive~grade, data=ex_df, family="binomial")
#--- output---
Call:  glm(formula = Survive ~ grade, family = "binomial", data = ex_df)

Coefficients:
(Intercept)       grade1       grade2       grade3  
    -0.2877      -0.4055       0.2877      -0.1178  

Degrees of Freedom: 23 Total (i.e. Null);  20 Residual
Null Deviance:      32.6 
Residual Deviance: 32.25    AIC: 40.25
#----------

summary( glm(Survive~grade, data=ex_df, family="binomial") )
#-------output------
Call:
glm(formula = Survive ~ grade, family = "binomial", data = ex_df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1774  -1.0579  -0.9005   1.3018   1.4823  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.2877     0.7638  -0.377    0.706
grade1       -0.4055     1.1547  -0.351    0.725
grade2        0.2877     1.1180   0.257    0.797
grade3       -0.1178     1.1902  -0.099    0.921

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 32.601  on 23  degrees of freedom
Residual deviance: 32.247  on 20  degrees of freedom
AIC: 40.247

Number of Fisher Scoring iterations: 4