Search code examples
rregextidyversetidyrregex-lookarounds

R tidyr: use separate function to separate character column with comma-separated text into multiple columns using RegEx


I have the following dataframe

df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three"))

It looks like this

                x
1             one
2        one, two
3      two, three
4 one, two, three

I would like to be able to separate this x column into many different columns, one for every distinct word in the column x. Basically I would like the final result to be something like this

    one  two  three
1    1    0     0
2    1    1     0
3    0    1     1
4    1    1     1

I think that in order to obtain that dataframe, I probably need to be able to use the separate function provided by tidyr and documented here. However, this requires knowledge of regular expressions, and I'm not good with them. Can anyone help me obtain this dataframe?

IMPORTANT: I do not know the number, nor the spelling of the words a priori.

Important Example

It should work also with empty strings. For instance if we have

df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three", ""))

then it should also work.


Solution

  • With tidyverse, we can use separate_rows to split up the 'x' column, create a sequence column and use pivot_wider from tidyr

    library(dplyr)
    library(tidyr)
    df %>% 
       filter(!(is.na(x)|x==""))%>% 
       mutate(rn = row_number()) %>% 
       separate_rows(x) %>%
       mutate(i1 = 1) %>% 
       pivot_wider(names_from = x, values_from = i1, , values_fill = list(i1 = 0)) %>%
       select(-rn)
    # A tibble: 4 x 3
    #    one   two three
    #  <dbl> <dbl> <dbl>
    #1     1     0     0
    #2     1     1     0
    #3     0     1     1
    #4     1     1     1
    

    In the above code, the rn column was added to have distinct identifier for each rows after we expand the rows with separate_rows, otherwise, it can result in a list output column in pivot_wider when there are duplicate elements. The 'i1' with value 1 is added to be used in the values_from. Another option is to specify values_fn = length


    Or we can use table after splitting the 'x' column in base R

    table(stack(setNames(strsplit(as.character(df$x), ",\\s+"), seq_len(nrow(df))))[2:1])