I have the following dataframe
df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three"))
It looks like this
x
1 one
2 one, two
3 two, three
4 one, two, three
I would like to be able to separate this x
column into many different columns, one for every distinct word in the column x
. Basically I would like the final result to be something like this
one two three
1 1 0 0
2 1 1 0
3 0 1 1
4 1 1 1
I think that in order to obtain that dataframe, I probably need to be able to use the separate
function provided by tidyr
and documented here. However, this requires knowledge of regular expressions, and I'm not good with them. Can anyone help me obtain this dataframe?
IMPORTANT: I do not know the number, nor the spelling of the words a priori.
It should work also with empty strings. For instance if we have
df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three", ""))
then it should also work.
With tidyverse
, we can use separate_rows
to split up the 'x' column, create a sequence column and use pivot_wider
from tidyr
library(dplyr)
library(tidyr)
df %>%
filter(!(is.na(x)|x==""))%>%
mutate(rn = row_number()) %>%
separate_rows(x) %>%
mutate(i1 = 1) %>%
pivot_wider(names_from = x, values_from = i1, , values_fill = list(i1 = 0)) %>%
select(-rn)
# A tibble: 4 x 3
# one two three
# <dbl> <dbl> <dbl>
#1 1 0 0
#2 1 1 0
#3 0 1 1
#4 1 1 1
In the above code, the rn
column was added to have distinct identifier for each rows after we expand the rows with separate_rows
, otherwise, it can result in a list
output column in pivot_wider
when there are duplicate elements. The 'i1' with value 1 is added to be used in the values_from
. Another option is to specify values_fn = length
Or we can use table
after splitting the 'x' column in base R
table(stack(setNames(strsplit(as.character(df$x), ",\\s+"), seq_len(nrow(df))))[2:1])