Search code examples
rsplitstatisticsrro

Improving performance of split() function in R?


I have a data frame in a very simple form:

    X Y
    ---
    A 1
    A 2
    B 3
    C 1
    C 3

My end result should be a list like this:

$`A`
[1] 1 2

$`B`
[1] 3

$`C`
[1] 1 3

For this operation I am using the split() function in R:

k <- split(Y, X)

This is working just fine. However, if I want to apply this code on a data frame containing 22 million rows including 10 million groups for X and 387000 values for Y it becomes really time consuming. I tried using the RRO 8.0 open version for MKL support. However, still only one Kernel is used. The CPU has 64 GB of RAM so that shouldn't be an issue.

Any ideas for a smarter way to compute this?


Solution

  • Try

     library(data.table)
     DT <- as.data.table(df)
     DT1 <- DT[, list(Y=list(Y)), by=X]
     DT1$Y
     #[[1]]
     #[1] 1 2
    
     #[[2]]
     #[1] 3
    
     #[[3]]
     #[1] 1 3
    

    Or using dplyr

     library(dplyr)
     df1 <-  df %>% 
                 group_by(X) %>%
                  do(Y=c(.$Y))
    
     df1$Y
     #[[1]]
     #[1] 1 2
    
     #[[2]]
     #[1] 3
    
     #[[3]]
     #[1] 1 3
    

    data

     df <- structure(list(X = c("A", "A", "B", "C", "C"), Y = c(1L, 2L, 
     3L, 1L, 3L)), .Names = c("X", "Y"), class = "data.frame", row.names = c(NA, 
     -5L))