Search code examples
rloopsfor-loopnested-loops

R loops - is there a more efficient way?


I have a data frame where each row has a unique ID. I need to replicate each one of these rows based on the number of days between the start date and the max of the end date and the approval date.

ID <- c(1,2)

Value <- c(10,20)

StartDate <- c(as.Date("01/01/2015", '%d/%m/%Y'),
    as.Date("01/01/2015", '%d/%m/%Y'))

EndDate <- c(as.Date("31/01/2015", '%d/%m/%Y'), 
    as.Date("15/01/2015", '%d/%m/%Y'))

AppDate <- c(as.Date("15/01/2015", '%d/%m/%Y'), 
    as.Date("15/02/2015", '%d/%m/%Y'))

df <- data.frame(ID, Value, StartDate, EndDate, AppDate)

df <- df[rep(row.names(df), ifelse(as.numeric(df$AppDate) >
    as.numeric(df$EndDate),as.numeric(df$AppDate-df$StartDate),
    as.numeric(df$EndDate-df$StartDate)) + 1),]

I then need to add a sequential list of dates from the start date to the max of the end date or approval date.

I've done this via 2 loops. The outer loop loops through the data frame for each unique ID. The second loop then goes through the ID and adds the date. Once the second loop has finished it passes the row to the outer loop as the new start point.

IDs <- unique(df$ID)
df$Days <- rep(as.Date("01/01/1999",'%d/%m/%Y'), nrow(df))
counter <- 1
for (i in 1:length(IDs)) {
    ref <- IDs[i]
    start <- 1
        while (df$ID[counter] == ref) {
            ifelse(start == 1, df$Days[counter] <- df$StartDate[counter],
                df$Days[counter] <- df$StartDate[counter] + start -1)
            ifelse (counter > nrow(df), break, counter <- counter + 1)
            ifelse (counter > nrow(df), break, start <- start + 1)
        }
 }

My actual data set has over 6,000 ID's and once I've replicated the rows it ends up being over 500,000 rows. The loop took over 15 minutes to run so it's obviously very inefficient.

So I guess I have 2 questions.

1). What is the most efficient way to do this in R

2). What would be the most efficient way of doing this in general i.e. in say something like C++

thanks


Solution

  • Here is one solution that is vectorized. Note: Your code does not match the concept of taking the maximum of EndDate and AppDate, which I tried to do, but if that is not what you want, you can modify the code accordingly.

    library(dplyr)
    df <- df %>% group_by(ID) %>% mutate(Days = rep(seq(min(StartDate), max(EndDate, df$AppDate), 'days'), ceiling(nrow(df) / n()))[1:n()])
    

    Output will be as follows (just the first few rows):

    head(df)
    Source: local data frame [6 x 6]
    Groups: ID [1]
    
         ID Value  StartDate    EndDate    AppDate       Days
      (dbl) (dbl)     (date)     (date)     (date)     (date)
    1     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-01
    2     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-02
    3     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-03
    4     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-04
    5     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-05
    6     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-06
    
    tail(df)
    Source: local data frame [6 x 6]
    Groups: ID [1]
    
         ID Value  StartDate    EndDate    AppDate       Days
      (dbl) (dbl)     (date)     (date)     (date)     (date)
    1     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-10
    2     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-11
    3     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-12
    4     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-13
    5     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-14
    6     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-15