Search code examples
rcategorical-datacrosstab

How do I crosstab observations with membership in multiple categories?


I have a data set with observations with a mix of mutually exclusive and non-mutually exclusive categories. For example, let's assume that mixed ethnicities do not exist and multiple citizenships do, so the data set looks something like:

 id white hispanic asian usa canada uk
 1     0        1     0   1      0  1
 2     1        0     0   0      1  0
 3     0        0     1   1      0  1
 4     1        0     0   1      1  0
 5     0        1     0   0      0  1
 6     0        0     1   0      0  1

As you can see, any one person/observation has only one ethnicity but can have multiple citizenships. I wish to break down ethnicity by citizenship and produce something like this:

         usa       canada    uk        total
white     1 (33%)   2 (66%)   0         3  
hispanic  1 (33%)   0         2 (66%)   3  
asian     1 (33%)   0         2 (66%)   3  
total     3         2         3      

How do I write a loop that would sum up across the categories so that I can do a crosstab between ethnicity and citizenship (double counting is okay)?

Any advice/suggestion on visualization for such data would be very much appreciated. Thanks muchly for your help!


Solution

  • Based on what I understand, you can mutate your data to a tidy format, then use janitor to get a cross-table:

    Data:

    df <- data.frame(id = seq(1,6),
                     white = c(0,1,0,1,0,0),
                     hispanic = c(1,0,0,0,1,0),
                     asian = c(0,0,1,0,0,1),
                     usa = c(1,0,1,1,0,0),
                     canada = c(0,1,0,1,0,0),
                     uk = c(1,0,1,0,1,1)) 
    

    Code:

    library(tidyverse)
    library(janitor)
    
    df %>% 
      pivot_longer(cols = 2:4,names_to = "Origin") %>% 
      filter(value == 1) %>% 
      select(-value) %>% 
      pivot_longer(cols = 2:4, names_to = "ethnicity") %>% 
      filter(value == 1) %>% 
      select(-value) %>% 
      tabyl(Origin, ethnicity) %>% 
      adorn_totals(where = c("row","col")) %>% 
      adorn_percentages(denominator = "col") %>% 
      adorn_pct_formatting(digits = 0) %>% 
      adorn_ns(position = "front")
    

    Output:

       Origin   canada       uk      usa    Total
        asian 0   (0%) 2  (50%) 1  (33%) 3  (33%)
     hispanic 0   (0%) 2  (50%) 1  (33%) 3  (33%)
        white 2 (100%) 0   (0%) 1  (33%) 3  (33%)
        Total 2 (100%) 4 (100%) 3 (100%) 9 (100%)