Search code examples
rfor-looptimevectorizationexecution

Rewrite a for-loop that takes forever


I wrote a piece of code in R that calculates the cumulative sum of some data. It works. Problem is, I have 25,000 numbers X 12 months that I need to "melt", so I end up with 300,000 rows (and every month there will be about 2000x12 more). The first six lines are to recreate a sample of my table (a huge excel file). Then there is some magic done to convert things into the right formats, and in the end I have this double for-loop that calculates the cumulative sum for every month based on wether it's a double "PDRcount" or not. The loop takes 6 hrs when I try it on my real data... How can I do this faster?

library(reshape2)

PDR <- (c( 1,2,3,4,5,2))
START <-  as.Date(c("2008-01-01","2007-01-01","2010-01-01","2011-01-01","2017-02-01","2017-03-01"))
SWITCHOUT <- as.Date(c(NA, "2017-02-28", NA, NA, "2017-03-31",NA))
JAN17 <- (c(100,124,165,178,0,0))
FEB17 <- (c(101,125,133,178,170,0))
MAR17 <- (c(99,0,165,180,166,99))
APR17 <- (c(100,0,156,178,0,78))

alldata <- data.frame(PDR=PDR,
                  START=START,
                  SWITCHOUT=SWITCHOUT,
                  JAN17=JAN17,
                  FEB17=FEB17,
                  MAR17=MAR17,
                  APR17=APR17)

## count PDR occurrences    
alldata$PDRcount <- ave(alldata$PDR,alldata$PDR,FUN=length)
alldata$PDRcount <- as.numeric(alldata$PDRcount)

crossdata<-melt(alldata,id=(c("PDR", "START","SWITCHOUT","PDRcount" )))
colnames(crossdata) <- c("PDR","START","SWITCHOUT","PDRcount","MONTH","SMC")

## transform levels to date format
levels(crossdata$MONTH)[1] <- "2017-01-01"
levels(crossdata$MONTH)[2] <- "2017-02-01"
levels(crossdata$MONTH)[3] <- "2017-03-01"
levels(crossdata$MONTH)[4] <- "2017-04-01"
crossdata$MONTH <- as.Date(crossdata$MONTH,format = "%Y-%m-%d" )


for (pdr in crossdata[,"PDR"]){

maxPDR <- max(crossdata$PDRcount[crossdata$PDR == pdr])
dates <- unique(crossdata$START[crossdata$PDR == pdr])

for (i in 1:maxPDR) {

CumSum <- cumsum( crossdata$SMC[crossdata$PDR == pdr & crossdata$START == dates[i]] )

    crossdata$SMCcum[crossdata$PDR == pdr & crossdata$START == dates[i] & crossdata$MONTH == "2017-01-01"] <- CumSum[1]
    crossdata$SMCcum[crossdata$PDR == pdr & crossdata$START == dates[i] & crossdata$MONTH == "2017-02-01"] <- CumSum[2]
    crossdata$SMCcum[crossdata$PDR == pdr & crossdata$START == dates[i]  & crossdata$MONTH == "2017-03-01"] <- CumSum[3]
    crossdata$SMCcum[crossdata$PDR == pdr & crossdata$START == dates[i] & crossdata$MONTH == "2017-04-01"] <- CumSum[4]     
}
}

edited: sorry there was an error...


Solution

  • This is a partial answer. I do not understand the part "...based on whether it's a double "PDRcount" or not."

    Here a partial answer for case where PDR !=2 using the dplyr library. I also simplified the data input by using dput on your crossdata variable prior to any calculation.

    crossdata1<-structure(list(PDR = c(1, 2, 3, 4, 5, 2, 1, 2, 3, 4, 5, 2, 1, 
                                       2, 3, 4, 5, 2, 1, 2, 3, 4, 5, 2),
                               START = structure(c(13879, 13514, 14610, 14975, 17198, 17226, 13879, 13514, 14610, 14975, 
                                        17198, 17226, 13879, 13514, 14610, 14975, 17198, 17226, 13879, 
                                        13514, 14610, 14975, 17198, 17226), class = "Date"), 
                               SWITCHOUT = structure(c(NA, 17225, NA, NA, 17256, NA, NA, 17225, NA, NA, 17256, NA, NA, 17225, 
                                          NA, NA, 17256, NA, NA, 17225, NA, NA, 17256, NA), class = "Date"), 
                               PDRcount = c(1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2), 
                               MONTH = structure(c(17167, 17167, 
                                         17167, 17167, 17167, 17167, 17198, 17198, 17198, 17198, 17198, 
                                         17198, 17226, 17226, 17226, 17226, 17226, 17226, 17257, 17257, 
                                         17257, 17257, 17257, 17257), class = "Date"), 
                               SMC = c(100, 124, 165, 178, 0, 0, 101, 125, 133, 178, 170, 0, 99, 0, 165, 
                                         180, 166, 99, 100, 0, 156, 178, 0, 78)), 
                          row.names = c(NA,  -24L), .Names = c("PDR", "START", "SWITCHOUT", "PDRcount", "MONTH", "SMC"),
                          class = "data.frame")   
    
    #test to see if starting data is the same
    identical(crossdata, crossdata1)
    library(dplyr)
    
    #group by and add the cumsum column to answer dataframe
    ans<-group_by(crossdata1, PDR) %>%
      mutate(SMCcum = cumsum(SMC))
    
    #rows where the 2 final dataframes do not match
    crossdata[-which(crossdata$SMCcum== ans$SMCcum),]
    

    Most likely the above lines could work if an additional filter is applied to remove cases where '...double "PDRcount" or not.' applies.

    I found this post helpful: cumsum in grouped data with dplyr

    Good luck.