Search code examples
rperformancefor-looplapply

Running a for loop in R on 60 million rows of data - is there a faster way?


I'm currently running a for loop in R that is attempting to create unique IDs for "trips" in a dataset, but sometimes a trip is split into two or more rows in the dataset. The for loop contains the logic that is needed to define whether a trip IS solely represented by one row or multiple rows. This means that the for loop needs to look at the current row it is assessing AND the previous row, assess a conditional statement, then apply the same ID as the previous row or increment the value by 1 to create a new ID.

I also had to add a specific check into the for loop that defaulted the first row to the value of 1 because there is no previous row to compare to and I was getting an error without this addition.

This worked well on a test dataset of 100 rows and gives me the output I want, but now I am running it on 60 million rows and its so far been running for over 24 hours. Is there a way to speed up this process and make it faster? I could not figure out how to use lapply or other apply functions with this complex of a conditional but maybe am misunderstanding how those functions work. I'm also not sure that would speed up the processing. Another option might be to split the table into separate tables and run my script in chunks, but it feels like the script should only be looking at each row and the row before it, so I'm not sure that would actually save time (but would provide the fail safe of intermediate save points in case my server crashes, etc.).

Any improvements to this would be welcome!

This is the current script:

Oct19Subset <- Oct19%>%
  arrange(media_uuid, `Entry Date`) %>% 
  mutate(prev_mediaUUID = c(0, head(media_uuid, -1)))%>% 
  mutate(linked_trip_id = 1)

for (i in 1:nrow(Oct19Subset)){
  if (i == 1) {
    Oct19Subset$linked_trip_id[i] <- 1
  } else {
    if (Oct19Subset$`Ride Type`[i] == "B" | 
         Oct19Subset$media_uuid[i] != Oct19Subset$prev_mediaUUID[i]){
      Oct19Subset$linked_trip_id[i] <- Oct19Subset$linked_trip_id[i-1] + 1
    } else Oct19Subset$linked_trip_id[i] <- Oct19Subset$linked_trip_id[i-1]
  }
} 

Solution

  • Try this:

    system.time(
      Oct19Subset <- Oct19%>%
        arrange(media_uuid, `Entry Date`) %>% 
        mutate(prev_mediaUUID = c(media_uuid[1] - 1, head(media_uuid, -1)))%>% 
        mutate(linked_trip_id = cumsum(`Ride Type` == "B" | media_uuid != prev_mediaUUID))
    )
    #>    user  system elapsed 
    #>    0.74    0.14    0.87
    

    Data:

    library(dplyr)
    
    N <- 6e6
    
    Oct19 <- setNames(
      data.frame(
       sample(1e5, N, 1),
        as.Date(sample(19000:19600, N, 1)),
        sample(LETTERS, N, 1)
      ), c("media_uuid", "Entry Date", "Ride Type")
    )
    

    Or you can skip the helper column if it isn't needed elsewhere:

    Oct19Subset <- Oct19%>%
      arrange(media_uuid, `Entry Date`) %>% 
      mutate(
        linked_trip_id = cumsum(
          `Ride Type` == "B" |
            media_uuid != c(media_uuid[1] - 1, head(media_uuid, -1))
        )
      )