Search code examples
rdataframedplyr

Create dummy variable across multiple columns by group


I have data in a household roster as in the dataframe below

hhroster <- data.frame(HHID = c(1, 1,   1,  2,  2,  3,  3,  3,  3,  4,  4,  4,  5,  5,  6),                     
                    INDID = c(1,    2,  3,  1,  2,  1,  2,  3,  4,  1,  2,  3,  1,  2,  1),
                    response_1 = c("yes",   "no",   "yes",  "yes",  "no",   "no",   "no",   "no",   "no",   "yes",  "yes",  "no",   "yes",  "yes",  "no"),
                    response_2 = c("no",    "no",   "yes",  "no",   "no",   "no",   "yes",  "no",   "no",   "no",   "no",   "no",   "yes",  "yes",  "no"))

and would like to create a dummy variable at household level with the value 1 indicating there was at least one yes response from an individual. The desired output is

hh <- data.frame(HHID = c(1,    2,  3,  4,  5,  6),
                       HH_response_1 = c(1, 1,  0,  1,  1,  0),
                       HH_response_2 = c(1, 0,  1,  0,  1,  0))

Add: I have realized the dataset has values such as DK, RF and missing values and would like if a household has all its values among these the aggregate value should be NA and not 0.


Solution

  • Here is a solution.
    Use across to get all columns of interest and check if there are any yes values by checking if the sum of logical values .x == "yes" is greater than zero.
    You can keep the results as logical, R will coerce F/T to 0/1 if and when necessary.

    hhroster <- data.frame(HHID = c(1, 1,   1,  2,  2,  3,  3,  3,  3,  4,  4,  4,  5,  5,  6),                     
                           INDID = c(1,    2,  3,  1,  2,  1,  2,  3,  4,  1,  2,  3,  1,  2,  1),
                           response_1 = c("yes",   "no",   "yes",  "yes",  "no",   "no",   "no",   "no",   "no",   "yes",  "yes",  "no",   "yes",  "yes",  "no"),
                           response_2 = c("no",    "no",   "yes",  "no",   "no",   "no",   "yes",  "no",   "no",   "no",   "no",   "no",   "yes",  "yes",  "no"))
    
    suppressPackageStartupMessages(
      library(dplyr)
    )
    
    hhroster %>%
      summarise(
        across(starts_with("response"), ~ sum(.x == "yes") > 0L),
        .by = HHID
      )
    #>   HHID response_1 response_2
    #> 1    1       TRUE       TRUE
    #> 2    2       TRUE      FALSE
    #> 3    3      FALSE       TRUE
    #> 4    4       TRUE      FALSE
    #> 5    5       TRUE       TRUE
    #> 6    6      FALSE      FALSE
    

    Created on 2024-02-10 with reprex v2.0.2