Search code examples
rdemographics

Creating an Ethnicity Variable with Multiple Column Names as Variables


I have a survey dataset that includes self-reported ethnicity. Participants were allowed to select as many ethnicities as they wanted to. The data structure looks like this:

Hispanic English Indian

1        NA      NA     

NA       1       NA     

NA       NA      1  

NA       1       1

1        1       1   

What I want to do is create a new categorical ethnicity variable where the column names take the place of the 1s above. In addition, if someone selected more than one ethnicity, then the categorical ethnicity variable should include both, like this:

Hispanic English Indian Ethnicity

1        NA      NA     Hispanic

NA       1       NA     English

NA       NA      1      Indian

NA       1       1      English_Indian

1        1       1      Hispanic_English_Indian


Solution

  • We can use apply to loop over the rows (MARGIN = 1), then paste the names of the row values that are not an NA

    df1$Ethnicity <- apply(df1, 1, function(x) 
         paste(names(x)[!is.na(x)], collapse= "_"))
    

    -output

     df1
      Hispanic English Indian               Ethnicity
    1        1      NA     NA                Hispanic
    2       NA       1     NA                 English
    3       NA      NA      1                  Indian
    4       NA       1      1          English_Indian
    5        1       1      1 Hispanic_English_Indian
    

    data

    df1 <- structure(list(Hispanic = c(1L, NA, NA, NA, 1L), 
    English = c(NA, 
    1L, NA, 1L, 1L), Indian = c(NA, NA, 1L, 1L, 1L)),
     class = "data.frame", row.names = c(NA, 
    -5L))