Search code examples
rloopsapplyprocessing-efficiency

Translate R for loop into apply function


I have written a for loop in my code

for(i in 2:nrow(ProductionWellYear2)) {

  if (ProductionWellYear2[i,ncol(ProductionWellYear2)] == 0) {
    ProductionWellYear2[i, ncol(ProductionWellYear2)] = ProductionWellYear2[i-1,ncol(ProductionWellYear2)] +1}


  else {ProductionWellYear2[i,ncol(ProductionWellYear2)] = ProductionWellYear2[i,ncol(ProductionWellYear2)]}


  }

However, this is very time intensive as this dataframe has over 800k rows. How can I make this quicker and avoid the for loop?


Solution

  • This should work for you, but without seeing your data I can't verify the results are what you want. That being said, there's really not much different here in the process as originally written, but benchmarking does seem to show it is faster with my example data, but not necessarily "fast".

    library(microbenchmark)
    # Create fake data
    set.seed(1)
    ProductionWellYear <- data.frame(A = as.integer(rnorm(2500)),
                                     B = as.integer(rnorm(2500)),
                                     C = as.integer(rnorm(2500))
    )
    
    # Copy it to confirm results of both processes are the same
    ProductionWellYear2 <- ProductionWellYear
    
    
    # Slightly modified original version
    method1 <- function() {
      cols <- ncol(ProductionWellYear)
      for(i in 2:nrow(ProductionWellYear)) {
        if (ProductionWellYear[i, cols] == 0) {
          ProductionWellYear[i, cols] = ProductionWellYear[i - 1, cols] +1
        }
        else {
          ProductionWellYear[i, cols] = ProductionWellYear[i, cols]
        }
      }
    }
    
    # New version
    method2 <- function() {
      cols <- ncol(ProductionWellYear2)
      sapply(2:nrow(ProductionWellYear2), function(i) {
        if (ProductionWellYear2[i, cols] == 0) {
          ProductionWellYear2[i, cols] <<- ProductionWellYear2[i - 1, cols] +1
        }
      })
    }
    
    
    # Comparing the outputs
    all(ProductionWellYear == ProductionWellYear2)
    #[1] TRUE
    
    result <- microbenchmark(method1(), method2())
    result
    #Unit: milliseconds
    #      expr      min       lq     mean   median       uq       max neval
    #  method1() 151.78802 167.3932 190.14905 176.2855 197.60406 337.9904   100
    #  method2()  45.56065  53.7744  67.55549  59.9299  72.81873 174.1417   100