Search code examples
rdata.tabledcast

Apply dcast multiple times for different variables


Suppose I have a data frame with id A and a couple of factor variables.

library(data.table)
library(dplyr)

mydf <- data.table(
  A =  as.factor(c("A", "B", "C", "D","E")),
  V1 = as.factor(c("x", "x", "y", "x", "y")),
  V2 = as.factor(c("u", "v", "w", "v", "u"))
)

In a real example, there can be more columns (say 1000) and the dataset has a couple of million rows, so I really care about performance.

I want to convert it into the following data.frame:

  A V1_x V1_y V2_u V2_v V2_w
1 A    1    0    1    0    0
2 B    1    0    0    1    0
3 C    0    1    0    0    1
4 D    1    0    0    1    0
5 E    0    1    1    0    0

What would be the efficient method to implement this conversion?

The way I do it right now is to apply dcast multiple times for each column separately. And rename columns accordingly.

f <- function(x) as.integer(length(x) > 0)
mydf2 =
  mydf %>%
  dcast(... ~ V1, fun.aggregate = f, value.var = "V1") %>%
  dcast(... ~ V2, fun.aggregate = f, value.var = "V2") 
  
colnames(mydf2) = c("A", "V1_x", "V1_y", "V2_u",  "V2_v",  "V2_w")

That does not seem efficient for many variables if we run it in a loop. In addition, renaming the variables at each step is not very robust if the number of factors varies from variable to variable.

Loop version for the same solution:


mydf2 = mydf
for(tmp_var in  c("V1","V2")){
  ncol_before = ncol(mydf2)
  mydf2 =
    mydf2 %>%
    dcast( as.formula( sprintf("... ~ %s", tmp_var)  ),
          fun.aggregate = f, value.var = tmp_var)
  ncol_after = ncol(mydf2)
  # assign correct names to created vars
  colnames(mydf2)[ncol_before:ncol_after ] = paste0(tmp_var, "_", colnames(mydf2)[ncol_before:ncol_after ])
}
mydf2


Solution

  • Here is an option with cSplit_e

    library(splitstackshape)
    cSplit_e(mydf, 'V1', type = 'character', fill = '0') %>%
          cSplit_e('V2', type = 'character', fill = '0')
    #   A V1 V2 V1_x V1_y V2_u V2_v V2_w
    #1: A  x  u    1    0    1    0    0
    #2: B  x  v    1    0    0    1    0
    #3: C  y  w    0    1    0    0    1
    #4: D  x  v    1    0    0    1    0
    #5: E  y  u    0    1    1    0    0
    

    Or with table from base R

     do.call(cbind, lapply(2:3, function(i) table(mydf$A, mydf[[i]])))
    

    Or the same approach in data.table syntax

    nm1 <- names(mydf)[-1]
    out <- mydf[,  lapply(.SD, function(x) 
             as.data.frame.matrix(table(A, x))), .SDcols = nm1]
    mydf[, names(out) := out][]
    #   A V1 V2 V1.x V1.y V2.u V2.v V2.w
    #1: A  x  u    1    0    1    0    0
    #2: B  x  v    1    0    0    1    0
    #3: C  y  w    0    1    0    0    1
    #4: D  x  v    1    0    0    1    0
    #5: E  y  u    0    1    1    0    0