Search code examples
rmatrixdplyrtapply

To speed up the tapply function in R, or another function to convert data frame into a matrix


I need to convert a huge dataset into a matrix. The data structure likes the below data "x". When I use the function tapply to do it (see below), it cannot work due to memory limit for the huge dataset. I am wondering if there is another way to do it. I tried for the R package dplyr, but didn't figure out how to do it. Any suggestion for that? Thanks so much.

x <- data.frame(c1=c("A","B","C","A","B","C"), c2=1:6, c3=c("sp1","sp2","sp2","sp2","sp3","sp4"))
y <- tapply(x$c2, list(x$c1, x$c3), sum)

Solution

  • You can use pivot_wider :

    tidyr::pivot_wider(x, names_from = c3, values_from = c2, 
                       values_fn = sum, values_fill = 0)
    
    #  c1      sp1   sp2   sp3   sp4
    #  <chr> <int> <int> <int> <int>
    #1 A         1     4     0     0
    #2 B         0     2     5     0
    #3 C         0     3     0     6
    

    data.table is usually faster in which case you can use -

    library(data.table)
    dcast(setDT(x), c1~c3, value.var = 'c2', fun.aggregate = sum)
    

    Output can also be represented in long format.

    library(dplyr)
    
    x %>% 
      group_by(c1, c3) %>%
      summarise(c2 = sum(c2)) %>%
      ungroup
    

    and with data.table -

    setDT(x)[, sum(c2), .(c1, c3)]