Search code examples
rdplyrdata-cleaning

Mutating a "rich" character vector into multiple variable denoting presence of their element


I'm currently cleaning some survey data where there are variables with multiple responses in each. For instance, respondents endorse all elements that apply and they all get stored in one variable e.x., "Dogs, Cats, Rhinos". A reproducible example of one such variable is given below:

library(dplyr); library(magrittr)
set.seed(42)

foo <- data.frame(x = c(sample(LETTERS[1:5],
                               size = runif(1, min = 0, max = 5),
                               replace = F) %>% paste0(collapse = ", "),
                        sample(LETTERS[1:5],
                               size = runif(1, min = 0, max = 5),
                               replace = F) %>% paste0(collapse = ", ")))

What I'm looking to accomplish is to decompose the elements a variable and have new variables denoting the presence (or lack) of a given element. In this case my separator for elements would be a comma. An example of the intended output given below.

fooWant <- data.frame("A" = c(1, 0), "B" = c(1, 1), "D" = c(1, 0), "E" = c(1, 1))

So far my progress hasn't been great and I've just accomplished at parsing the elements into nested lists (code below) and am hoping that someone can take me the rest of the way there. Thanks a ton :)

strsplit(foo$x %>% as.character, split = "[,]\\s?") %>% sapply(X = ., sort)

Solution

  • A tidyverse solution using tidyr::separate_rows and tidyr::spread

    foo %>%
        rowid_to_column("row") %>%
        separate_rows(x) %>%
        mutate(n = 1) %>%
        spread(x, n, fill = 0) %>%
        select(-row)
    #  A B D E
    #1 1 1 1 1
    #2 0 1 0 1