Search code examples
rchi-squaredpopulation

Finding differences between populations


I have data equivalent data from 2019 and 2020. The proportion of diagnoses in 2020 look like they differ from 2019, but I'd like to ...

a) statistically test the populations are different. b) determine which categories are the most different.

I've worked out I can do 'a' using:

chisq.test(test$count.2020, test$count.2019)

I don't know how to find out which categories are the ones that are the most different between 2020 and 2019. Any help would be amazing, thanks!

diagnosis <- data.frame(mf_label = c("Audiovestibular", "Autonomic", "Cardiovascular", 
                           "Cerebral palsy", "Cerebrovascular", "COVID", "Cranial nerves", 
                           "CSF disorders", "Developmental", "Epilepsy and consciousness", 
                           "Functional", "Head injury", "Headache", "Hearing loss", "Infection", 
                           "Maxillofacial", "Movement disorders", "Muscle and NMJ", "Musculoskeletal", 
                           "Myelopathy", "Neurodegenerative", "Neuroinflammatory", "Peripheral nerve", 
                           "Plexopathy", "Psychiatric", "Radiculopathy", "Spinal", "Syncope", 
                           "Toxic and nutritional", "Tumour", "Visual system"),
              count.2019 = c(5, 0, 1, 1, 2, 0, 4, 3, 0, 7, 4, 0, 24, 0, 0, 2, 22, 3, 3, 0, 3, 18, 12, 0, 0, 2, 2, 0, 1, 4, 0),
              count.2020 = c(5, 1, 1, 3, 28, 9, 11, 13, 1, 13, 30, 5, 68, 1, 1, 2, 57, 14, 5, 8, 16, 37, 27, 3, 13, 17, 3, 1, 8, 13, 11))

Solution

  • Your Chi square test is not correct. You need to provide the counts as a table or matrix, not as two separate vectors. Because you have very small expected values for half of the cells, you need to use simulation to estimate the p-value:

    results <- chisq.test(diagnosis[, 2:3], simulate.p.value=TRUE)
    

    The overall table is barely significant at .05. The chisq.test function returns a list including the original data, the expected values, residuals, and standardized residuals. The manual page describes these (?chisq.test) and provides some citations for more details.