Search code examples
rcomparisont-test

How to compare the amount of cases in two same-sized groups?


This is most likely a very simple question but I'll ask it nevertheless since I haven't found an answer. How can I compare the amount of "cases" (for example flu) in two groups i.e. find out if the difference between the amounts of cases in the groups is statistically significant? Can I apply some sort of t-test? Or is it even meaningful to do this kind of a comparison?

I'd preferably do the comparison in R.

A very simple data example:

group1 <- 1000 # size of group 1
group2 <- 1000 # size of group 2

group1_cases <- 550 # the amount of cases in group 1
group2_cases <- 70 # the amount of cases in group 2

Solution

  • I think a chisq.test is what you are looking for.

    group1 <- 1000 # size of group 1
    group2 <- 1000 # size of group 2
    
    group1_cases <- 550 # the amount of cases in group 1
    group2_cases <- 70 # the amount of cases in group 2
    
    group1_noncases <- 1000 - group1_cases
    group2_noncases <- 1000 - group2_cases
    
    
    M <- as.table(rbind(c(group1_cases, group1_noncases),
                        c(group2_cases, group2_noncases)))
    
    dimnames(M) <- list(groups = c("1", "2"),
                        cases = c("yes","no"))
    
    res <- chisq.test(M)
    
    # The Null, that the two groups are equal, has to be rejected:
    
    res
    #> 
    #>  Pearson's Chi-squared test with Yates' continuity correction
    #> 
    #> data:  M
    #> X-squared = 536.33, df = 1, p-value < 2.2e-16
    
    # if both groups were equal then this would be the expected values:
    
    res$expected
    #>       cases
    #> groups yes  no
    #>      1 310 690
    #>      2 310 690
    

    Created on 2021-04-28 by the reprex package (v0.3.0)

    Statistically a t.test would not be the correct method. However, people use it for this kind of test and in most cases the p values are very simillar.

    # t test
    dat <- data.frame(groups = c(rep("1", 1000), rep("2", 1000)),
           values = c(rep(1, group1_cases),
                      rep(0, group1_noncases),
                      rep(1, group2_cases),
                      rep(0, group2_noncases)))
    
    t.test(dat$values ~ dat$groups)
    
    #> 
    #>  Welch Two Sample t-test
    #> 
    #> data:  dat$values by dat$groups
    #> t = 27.135, df = 1490.5, p-value < 2.2e-16
    #> alternative hypothesis: true difference in means is not equal to 0
    #> 95 percent confidence interval:
    #>  0.4453013 0.5146987
    #> sample estimates:
    #> mean in group 1 mean in group 2 
    #>            0.55            0.07
    

    Created on 2021-04-28 by the reprex package (v0.3.0)