I have a dataframe like this:
df <- data.frame(V1=c("a,b,c,d,e,f","a,b,c","e,f","b,d","a,e"))
I want to generate all possible dummies according to categories in var V1
, something like this:
df$a <- c(1,1,0,0,1)
df$b <- c(1,1,0,1,0)
df$c <- c(1,1,0,0,0)
df$d <- c(1,0,0,1,0)
df$e <- c(1,0,1,0,1)
df$f <- c(1,0,1,0,0)
> df
V1 a b c d e f
1 a,b,c,d,e,f 1 1 1 1 1 1
2 a,b,c 1 1 1 0 0 0
3 e,f 0 0 0 0 1 1
4 b,d 0 1 0 1 0 0
5 a,e 1 0 0 0 1 0
How can I do this efficiently? I have a big dataframe and V1
has a lot of categories.
Here is a solution which uses strsplit()
to split up the character strings and dcast()
to reshape from long to wide format:
library(data.table)
setDT(df)[, rn := .I][
, strsplit(as.character(V1), ","), by = rn][
, dcast(.SD, rn ~ V1, length)]
rn a b c d e f 1: 1 1 1 1 1 1 1 2: 2 1 1 1 0 0 0 3: 3 0 0 0 0 1 1 4: 4 0 1 0 1 0 0 5: 5 1 0 0 0 1 0
If V1
is to be included, it can be joined afterwards:
library(data.table) # version 1.11.4 used
setDT(df)[, rn := .I][
, strsplit(as.character(V1), ","), by = rn][
, dcast(.SD, rn ~ V1, length)][
df, on = "rn"][
, setcolorder(.SD, "V1")]
V1 rn a b c d e f 1: a,b,c,d,e,f 1 1 1 1 1 1 1 2: a,b,c 2 1 1 1 0 0 0 3: e,f 3 0 0 0 0 1 1 4: b,d 4 0 1 0 1 0 0 5: a,e 5 1 0 0 0 1 0
setcolorder()
is used to move the V1
column to the front.