Search code examples
pythonstatisticslogistic-regressionglmanova

Compare proportions and Statistical significance between more than 2 groups


I created a dataframe to compare the proportions of three or more groups (Real data has more than 50,000 rows). In the left column, 0 indicates survival, 1 indicates death, and 0,1,2,3 in the right column indicates the grade.

In the example dataframe, the proportions by grade do not seem to differ, but I want to derive the p-value whether this is actually statistically significant or not.

The survival rates obtained from the examples are as follows.

grade 0 57.14%

grade 1 66.66%

grade 2 50.0%

grade 3 60.0%

I tried to use both Kai and Anova tests. However, I don't know exactly which method is correct.

ex_df = pd.DataFrame({"Survive":[0,0,0,0,0,0,0,1,1,1,0,1,1,1,0,0,0,1,1,1,0,0,0,1],
              "grade":[2,1,2,3,0,0,0,3,0,2,1,1,0,0,0,1,2,3,1,2,1,3,3,2]})

I want to calculate the p-value

p-value : 0.xxxx ....


Solution

  • This is a problem which would best be analyzed with logistic regression. (The proposed method of using a binomial test of grade groups against the entire samle is incorrect statistically.) Using R (since I'm not a pythonista) it's faily easily demonstrated, and I suspect that there is a python analog. The R results can be used to check the correctness of any Python implementation. As you can see the dataframe structure in Pandas was copied from R as were many of its statistical routines:

    ex_df = data.frame(Survive = c(0,0,0,0,0,0,0,1,1,1,0,1,1,1,0,0,0,1,1,1,0,0,0,1),
                   grade=factor(c(2,1,2,3,0,0,0,3,0,2,1,1,0,0,0,1,2,3,1,2,1,3,3,2)) )
     glm(Survive~grade, data=ex_df, family="binomial")
    #--- output---
    Call:  glm(formula = Survive ~ grade, family = "binomial", data = ex_df)
    
    Coefficients:
    (Intercept)       grade1       grade2       grade3  
        -0.2877      -0.4055       0.2877      -0.1178  
    
    Degrees of Freedom: 23 Total (i.e. Null);  20 Residual
    Null Deviance:      32.6 
    Residual Deviance: 32.25    AIC: 40.25
    #----------
    
    summary( glm(Survive~grade, data=ex_df, family="binomial") )
    #-------output------
    Call:
    glm(formula = Survive ~ grade, family = "binomial", data = ex_df)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -1.1774  -1.0579  -0.9005   1.3018   1.4823  
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)
    (Intercept)  -0.2877     0.7638  -0.377    0.706
    grade1       -0.4055     1.1547  -0.351    0.725
    grade2        0.2877     1.1180   0.257    0.797
    grade3       -0.1178     1.1902  -0.099    0.921
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 32.601  on 23  degrees of freedom
    Residual deviance: 32.247  on 20  degrees of freedom
    AIC: 40.247
    
    Number of Fisher Scoring iterations: 4