Search code examples
rstringdummy-variable

Generate all posible dummies according values of var in r


I have a dataframe like this:

df <- data.frame(V1=c("a,b,c,d,e,f","a,b,c","e,f","b,d","a,e"))

I want to generate all possible dummies according to categories in var V1, something like this:

df$a <- c(1,1,0,0,1)
df$b <- c(1,1,0,1,0)
df$c <- c(1,1,0,0,0)
df$d <- c(1,0,0,1,0)
df$e <- c(1,0,1,0,1)
df$f <- c(1,0,1,0,0)

> df
           V1 a b c d e f
1 a,b,c,d,e,f 1 1 1 1 1 1
2       a,b,c 1 1 1 0 0 0
3         e,f 0 0 0 0 1 1
4         b,d 0 1 0 1 0 0
5         a,e 1 0 0 0 1 0

How can I do this efficiently? I have a big dataframe and V1 has a lot of categories.


Solution

  • Here is a solution which uses strsplit() to split up the character strings and dcast() to reshape from long to wide format:

    library(data.table)
    setDT(df)[, rn := .I][
      , strsplit(as.character(V1), ","), by = rn][
        , dcast(.SD, rn ~ V1, length)]
    
       rn a b c d e f
    1:  1 1 1 1 1 1 1
    2:  2 1 1 1 0 0 0
    3:  3 0 0 0 0 1 1
    4:  4 0 1 0 1 0 0
    5:  5 1 0 0 0 1 0
    

    If V1 is to be included, it can be joined afterwards:

    library(data.table) # version 1.11.4 used
    setDT(df)[, rn := .I][
      , strsplit(as.character(V1), ","), by = rn][
        , dcast(.SD, rn ~ V1, length)][
          df, on = "rn"][
            , setcolorder(.SD, "V1")]
    
                V1 rn a b c d e f
    1: a,b,c,d,e,f  1 1 1 1 1 1 1
    2:       a,b,c  2 1 1 1 0 0 0
    3:         e,f  3 0 0 0 0 1 1
    4:         b,d  4 0 1 0 1 0 0
    5:         a,e  5 1 0 0 0 1 0
    

    setcolorder() is used to move the V1 column to the front.