Search code examples
rfor-loopnested-loops

Optimizing For loop with nested if in R


I am trying to merge multiple csv files into a single dataframe and trying to manipulate the resultant dataframe using a for loop. The resultant dataframe may have anywhere between 1,500,000 to 2,000,000 rows.

I am using the below code for the same.

setwd("D:/Projects")
library(dplyr)
library(readr)
merge_data = function(path) 
{ 
  files = dir(path, pattern = '\\.csv', full.names = TRUE)
  tables = lapply(files, read_csv)
  do.call(rbind, tables)
}


Data = merge_data("D:/Projects")
Data1 = cbind(Data[,c(8,9,17)],Category = "",stringsAsFactors=FALSE)
head(Data1)

for (i in 1:nrow(Data1))
{ 
  Data1$Category[i] = ""
  Data1$Category[i] = ifelse(Data1$Days[i] <= 30, "<30",
                       ifelse(Data1$Days[i] <= 60, "31-60",
                       ifelse(Data1$Days[i] <= 90, "61-90",">90")))     

}

However the code is running for very long. Is there a better and faster way of doing the same operation?


Solution

  • We can make this more optimized by reading with fread from data.table and then using cut/findInterval. This will become more pronounced when it is run in multiple cores, nodes on a server where fread utilize all the nodes and execute parallelly

    library(data.table)
    merge_data <- function(path) { 
       files = dir(path, pattern = '\\.csv', full.names = TRUE)
      rbindlist(lapply(files, fread, select = c(8, 9, 17)))
     }
    
    Data <- merge_data("D:/Projects")
    Data[, Category := cut(Data1, breaks = c(-Inf, 30, 60, 90, Inf), 
          labels = c("<=30", "31-60", "61-90", ">90"))]