Search code examples
roptimizationwhile-loopnestedmulticore

How to optimize the following code with nested while-loop? Multicore an option?


I am having a challenge with a piece of code that takes very long to execute and I am wondering what are the key tricks to optimize the execution time of this code. I have to admit that the input data.frame is significant (140,000 rows) and that the output data.frame is approximately 220,000 rows.

A sample of the input data.frame:

head(extremes)
X_BusinessIDDescription     min         max         month
ID105                       2007-12-01  2008-06-01  2007-12-01
ID206                       2007-12-01  2009-07-01  2007-12-01
ID204                       2007-12-01  2008-02-01  2007-12-01
ID785                       2008-07-01  2010-08-01  2008-07-01
ID125                       2007-11-01  2008-07-01  2007-11-01
ID107                       2007-11-01  2011-06-01  2007-11-01

The data.frame that will be extended with the loop. The data.frame is initiated to get the structure in place.

output <- extremes[1,]
output
X_BusinessIDDescription     min         max         month
ID105                       2007-12-01  2008-06-01  2007-12-01

Other values

IDcounter <- 1
IDmax <- nrow(extremes)
linecounter <- 1

The while-loop I would like to optimize:

while (IDcounter <= IDmax){
    start <- extremes$min[IDcounter]
    end <- extremes$max[IDcounter] # add three months
    while(start <= end){
        output[linecounter,] <- extremes[IDcounter,]
        output$month[linecounter] <- start
        linecounter <- linecounter+1
        start <- seq(start, by ="month", length=2)[2]
    }
    IDcounter <- IDcounter + 1
}

For a small number of rows this code executes pretty quickly, but it seems like it is slowing down as the output extends.

The output looks something like this:

head(output)
X_BusinessIDDescription     min         max         month
ID105                       2007-12-01  2008-06-01  2007-12-01
ID105                       2007-12-01  2008-06-01  2008-01-01
ID105                       2007-12-01  2008-06-01  2008-02-01
ID105                       2007-12-01  2008-06-01  2008-03-01
ID105                       2007-12-01  2008-06-01  2008-04-01
ID105                       2007-12-01  2008-06-01  2008-05-01

For every month in the interval between min and max in the extreme file is an row created.

I also would be interested to learn how I can can that this code can take ready of the multiple cores of computing resources available. OK, I admit this is not really an optimization but it will reduce the execution time, which is important as well.

Jochem


Solution

  • As @CarlWitthoft already mentioned you have to rethink your data structure because of many duplicated data.

    Here you find a simple vectorized approach:

      ## create all possible ranges of months
      ranges <- mapply(function(mi, ma) {seq(from=mi, to=ma, by="month")}, mi=extremes$min, ma=extremes$max)
    
      ## how many months per ID?
      n <- unlist(lapply(ranges, length))
    
      ## create new data.frame
      output <- data.frame(X_BusinessIDDescription=rep(extremes$X_BusinessIDDescription, n),
                          min=rep(extremes$min, n),
                          max=rep(extremes$max, n),
                          month=as.Date(unlist(ranges), origin="1970-01-01"), stringsAsFactors=FALSE)
    

    Comparison to your approach:

    extremes <- data.frame(X_BusinessIDDescription=c("ID105", "ID206", "ID204", "ID785", "ID125", "ID107"),
                          min=as.Date(c("2007-12-01", "2007-12-01", "2007-12-01", "2008-07-01", "2007-11-01", "2007-11-01")),
                          max=as.Date(c("2008-06-01", "2009-07-01", "2008-02-01", "2010-08-01", "2008-07-01", "2011-06-01")),
                          month=as.Date(c("2007-12-01", "2007-12-01", "2007-12-01", "2008-07-01", "2007-11-01", "2007-11-01")),
                          stringsAsFactors=FALSE)
    
    approachWhile <- function(extremes) {
      output <- data.frame(X_BusinessIDDescription=NA, min=as.Date("1970-01-01"), max=as.Date("1970-01-01"), month=as.Date("1970-01-01"), stringsAsFactors=FALSE)
      IDcounter <- 1
      IDmax <- nrow(extremes)
      linecounter <- 1
      while (IDcounter <= IDmax){
        start <- extremes$min[IDcounter]
        end <- extremes$max[IDcounter] # add three months
        while(start <= end){
            output[linecounter,] <- extremes[IDcounter,]
            output$month[linecounter] <- start
            linecounter <- linecounter+1
            start <- seq(start, by ="month", length=2)[2]
        }
        IDcounter <- IDcounter + 1
      }
      return(output)
    }
    
    approachMapply <- function(extremes) {                       
      ranges <- mapply(function(mi, ma) {seq(from=mi, to=ma, by="month")}, mi=extremes$min, ma=extremes$max)
    
      n <- unlist(lapply(ranges, length))
    
      output <- data.frame(X_BusinessIDDescription=rep(extremes$X_BusinessIDDescription, n),
                          min=rep(extremes$min, n),
                          max=rep(extremes$max, n),
                          month=as.Date(unlist(ranges), origin="1970-01-01"), stringsAsFactors=FALSE)
      return(output)
    }
    
    identical(approachWhile(extremes), approachMapply(extremes)) ## TRUE
    
    library("rbenchmark")
    
    benchmark(approachWhile(extremes), approachMapply(extremes), order="relative")
    #                      test replications elapsed relative user.self sys.self
    #2 approachMapply(extremes)          100   0.176     1.00     0.172    0.000
    #1  approachWhile(extremes)          100   6.102    34.67     6.077    0.008