Search code examples
rprocedural-programmingdeclarative-programming

More efficient version of this R loop


I'm used to Python and JS, and pretty new to R, but enjoying it for data analysis. I was looking to create a new field in my data frame, based on some if/else logic, and tried to do it in a standard/procedural way:

for (i in 1:nrow(df)) {
  if (is.na(df$First_Payment_date[i]) == TRUE) {
    df$User_status[i] = "User never paid"
  } else if (df$Payment_Date[i] >= df$First_Payment_date[i]) {
    df$User_status[i] = "Paying user"
  } else if (df$Payment_Date[i] < df$First_Payment_date[i]) {
    df$User_status[i] = "Attempt before first payment"
  } else {
    df$User_status[i] = "Error"
  }
}

But it was CRAZY slow. I tried running this on a data frame of ~3 million rows, and it took way, way too long. Any tips on the "R" way of doing this?

Note that the df$Payment_Date and df$First_Payment_date fields are formatted as dates.


Solution

  • I am benchmarking data.frame and data.table for relatively large dataset.

    First we generate some data.

    set.seed(1234)
    library(data.table)
    df = data.frame(First_Payment_date=c(sample(c(NA,1:100),1000000, replace=1)),
                     Payment_Date=c(sample(1:100,1000000, replace=1)))
    dt = data.table(df)
    

    Then set up the benchmark. I am testing between @BondedDust's answer and its data.table equivalence. I have slightly modified (debug) his code.

    library(microbenchmark)
    
    test_df = function(){
        df$User_status <- "Error"
        df$User_status[ is.na(df$First_Payment_date) ] <- "User never paid"
        df$User_status[ df$Payment_Date >= df$First_Payment_date ] <- "Paying user"
        df$User_status[ df$Payment_Date < df$First_Payment_date ] <- "Attempt before first payment"
    }
    
    test_dt = function(){
        dt[, User_status := "Error"]
        dt[is.na(First_Payment_date), User_status := "User never paid"]
        dt[Payment_Date >= First_Payment_date, User_status := "Paying user"]
        dt[Payment_Date < First_Payment_date, User_status := "Attempt before first payment"]
    }
    
    microbenchmark(test_df(), test_dt(), times=10)
    

    The result: data.table is 4x faster than data.frame for the generated 1 million rows data.

    > microbenchmark(test_df(), test_dt(), times=10)
    Unit: milliseconds
          expr       min        lq    median        uq       max neval
     test_df() 247.29182 256.69067 287.89768 319.34873 330.33915    10
     test_dt()  66.74265  69.42574  70.27826  72.93969  80.89847    10
    

    Note

    data.frame is faster than data.table for small dataset (say, 10000 rows.)