Search code examples
rcountdplyrfrequency

Counting number of occurrences in multiple factor variables


I have several columns with information on household gender composition (10 variables total). I want to count the number of males in each household.

Head of the dataset:

   gndr  gndr2  gndr3  gndr4  gndr5  gndr6  gndr7  gndr8  gndr9 gndr10 
1   Male Female   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   
2  Female  Male Female   Male   Male   Male   Male   <NA>   <NA>   <NA>   
3  Female  Male Female   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   
4   Male   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   
5   Male Female   Male Female Female   Male   Male   <NA>   <NA>   <NA>   

I want to create a table that has the number of households with no male, with one male, with two males, and so on.

Is there any code in the dplyr and tidyr packages that can do that?


Solution

  • This is a pretty standard use case where data was collected in "wide" form and is best worked with in "long" form. The difference is that in long form you only have one column for gender and another for which individual that gender belongs to. We'll use tidyr::gather to wrangle it into long form and the use dplyr to summarize the number of households with 1, 2, 3, ... men.

    library(dplyr)
    library(tidyr)
    
    wide.df <- tribble(
      ~gndr,  ~gndr2,  ~gndr3,  ~gndr4,  ~gndr5,  ~gndr6,  ~gndr7,  ~gndr8,  ~gndr9, ~gndr10,
      "Male", "Female", NA, NA, NA, NA, NA, NA, NA, NA,   
      "Female", "Male", "Female", "Male", "Male", "Male", "Male", NA, NA, NA,   
      "Female", "Male", "Female", NA, NA, NA, NA, NA, NA, NA,   
      "Male", NA, NA, NA, NA, NA, NA, NA, NA, NA,   
      "Male", "Female", "Male", "Female", "Female", "Male", "Male", NA, NA, NA 
    )
    
    wide.df %>%
      mutate(household = 1:nrow(.)) %>%
      gather(key = "individual", value = "gender", -household) %>%
      mutate(individual = factor(individual),
             gender = factor(gender)) %>%
      filter(gender == "Male") %>%
      group_by(household) %>%
      summarize(males = n()) %>%
      arrange(desc(males)) %>%
      group_by(males) %>%
      summarize(households = n())
    
    # # A tibble: 3 x 2
    # males households
    # <int>      <int>
    # 1          3
    # 4          1
    # 5          1
    

    Or if you want to count the numbers of households with males and females, then it's just an additional grouping.

    wide.df %>%
      mutate(household = 1:nrow(.)) %>%
      gather(key = "individual", value = "gender", -household) %>%
      mutate(individual = factor(individual),
             gender = factor(gender)) %>%
      filter(!is.na(gender)) %>%
      group_by(household, gender) %>%
      summarize(count = n()) %>%
      group_by(gender, count) %>%
      summarize(households = n()) %>%
      arrange(count)
    
    # # A tibble: 6 x 3
    # # Groups:   gender [2]
    # gender count households
    # <fct>  <int>      <int>
    # Female     1          1
    # Male       1          3
    # Female     2          2
    # Female     3          1
    # Male       4          1
    # Male       5          1