Search code examples
rperformancedataframefor-loopbioconductor

how to improve the processing time of my R code


I need to make some manipulation to a dataset but my script (shown below) is running really slow. The data set is a that has dimension: 58347 x 41350. I tried to first run the R script below on a much smaller dataset (58347 x 5) and it took me an hour to process it. I would imagine it's going to take much longer to process the actual dataset. Do you guys know any way to make it run faster?

Please see my codes below:

library("LoomExperiment")
dataset<-import("WongAdultRetina homo_sapiens 2019-11-08 16.13.loom")
m<-assay(dataset)
colsums<-colSums(m)
result<-data.frame()
  for(i in seq_len(nrow(m))){
    if(i%%500==0){
      print(paste("i =",i))
    }
    for(j in seq_len(ncol(m))){
      if(colsums[j]== 0){
        result[i,j]<- 0
      }
      else {
        result[i,j]<-(m[i,j]*2000)/colsums[j]
      }
    }
  }
save(result,file="resultlocal.rda")

Thank you so much.


Solution

  • It's hard to say what to do without understanding exactly what you're trying to achieve here. But I'll try.

    First, you can replace data.frame with data.table. From my experience they're much faster to work with.

    Second, you can create result data.frame with a specified size. For example, it looks like it will always have a size of nrow(m) by ncol(m). So, result = as.data.frame(matrix(nrow = nrow(m), ncol = ncol(m))). Of course you can always replace it with data.table too. Specifying the size of data.frame will allocate enough memory to the object. This way, R won't have to grow (copy the contents of original frame into an object that is one unit bigger and then delete the original) the object to just add another element.