Search code examples
rmultiple-columnstidyrsurveystrsplit

Conditionally Separate String into Columns


It is common in surveys to ask a question and then tell participants to "select all that apply". For example, "Which foods do you enjoy eating (Please select all that apply)?" a) Sushi, b) Pasta, c) Hamburger.

Assuming four (N=4) participants answered this question, the data could look like this.

food.df <- data.frame(id = c(1,2,3,4), food.choice = c("1,2", "", "1,2,3", "3"))

What I am trying to do is conditionally separate these into unique columns using a method that is flexible on the number of individuals and the number of food choice attributes (i.e. Sushi, Pasta, Hamburger, ....). The final data would look something like this.

food.final <- data.frame(id= c(1,2,3,4), sushi = c(1,0,1,0), pasta = c(1,0,1,0), hamburger = c(0,0,1,1))

The more advanced version of this would allow for conditional groupings. You can think of this as grouping by food groups, location, etc. Assuming we were grouping by "selected foods that have protein" this could be coded to reflect total choices. This could look something like this.

food.group <- data.frame(id = c(1,2,3,4), protein = c(1,0,2,1), non.protein = c(1,0,1,0))

I have tried to use tidyr::separate, strsplit, and other column splitting functions but cannot seem to get the desired outcome. Appreciate the help on this and hopefully, the answer helps other users of R who do survey work.


Solution

  • We may use fastDummies

    library(fastDummies)
    library(dplyr)
    dummy_cols(food.df, 'food.choice', split = ",", 
        remove_selected_columns = TRUE) %>%
        setNames(c("id", "sushi", "pasta", "hamburger"))
    

    -output

       id sushi pasta hamburger
    1  1     1     1         0
    2  2     0     0         0
    3  3     1     1         1
    4  4     0     0         1
    

    If the renaming should be automatic, create a named vector and use str_replace

    library(stringr)
    nm1 <- setNames(c("sushi", "pasta", "hamburger"), 1:3)
     dummy_cols(food.df, 'food.choice', split = ",", 
        remove_selected_columns = TRUE) %>% 
       rename_with(~ str_replace_all(str_remove(.x, 'food.choice_'), nm1), -id)
      id sushi pasta hamburger
    1  1     1     1         0
    2  2     0     0         0
    3  3     1     1         1
    4  4     0     0         1
    

    For the second case, we may use str_count

    food.df %>%
       mutate(protein = str_count(food.choice, '[13]'), 
        non.protein = str_count(food.choice, '2'), .keep = 'unused')
      id protein non.protein
    1  1       1           1
    2  2       0           0
    3  3       2           1
    4  4       1           0