Search code examples
rperformancefor-loopmultiple-conditions

How to avoid for-loops with multiple criteria in function which()


I have a 25 years data set that looks similar to the following:

        date name        value tag
1 2014-12-01    f -0.338578654  12
2 2014-12-01    a  0.323379254   4
3 2014-12-01    f  0.004163806   9
4 2014-12-01    f  1.365219477   2
5 2014-12-01    l -1.225602543   7
6 2014-12-01    d -0.308544089   9

This is how to replicate it:

set.seed(9)
date <- rep(seq(as.Date("1990-01-01"), as.Date("2015-01-1"), by="months"), each=50)
N <- length(date)
name <- sample(letters, N, replace=T)
value <- rnorm(N)
tag <- sample(c(1:50), N, replace=T)
mydata <- data.frame(date, name, value, tag)
head(mydata)

I would like to create a new matrix that stores values that satisfy multiple criteria. For instance, the sum of values that have a name j and a tag i. I use two for-loops and the which() function to filter out the correct values. Like this:

S <- matrix(data=NA, nrow=length(unique(mydata$tag)), ncol=length(unique(mydata$name)))
for(i in 1:nrow(S)){
  for (j in 1:ncol(S)){
    foo <- which(mydata$tag == unique(mydata$tag)[i] & mydata$name == unique(mydata$name)[j])
    S[i,j] <- sum(mydata$value[foo])
  }
}

This is ok for small data sets, but too slow for larger ones. Is it possible to avoid the for-loops or somehow speed up the process?


Solution

  • You can use dcast from package reshape2, with a custom function to sum your values:

    library(reshape2)
    dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
    

    Or simply xtabs, base R:

    xtabs(value~name+tag, mydata)
    

    Some benchmark:

    funcPer = function(){
        S <- matrix(data=NA, nrow=length(unique(mydata$tag)), ncol=length(unique(mydata$name)))
        for(i in 1:nrow(S)){
          for (j in 1:ncol(S)){
            foo <- which(mydata$tag == unique(mydata$tag)[i] & mydata$name == unique(mydata$name)[j])
            S[i,j] <- sum(mydata$value[foo])
          }
        }
    }
    
    colonel1 = function() dcast(mydata, name~tag, value.var='value', fun.aggregate=sum)
    
    colonel2 = function() xtabs(value~name+tag, mydata)
    
    #> system.time(colonel1())
    #  user  system elapsed 
    #   0.01    0.00    0.01 
    #> system.time(colonel2())
    #   user  system elapsed 
    #   0.05    0.00    0.05 
    #> system.time(funcPer())
    #   user  system elapsed 
    #   4.67    0.00    4.82