Search code examples
rperformancefor-looptimevectorization

I need a faster approach for comparing values from two dataframes than a for loop


I'm having the problem that I want to iterate or get the specific data from the rows of one data frame to then obtain some results from another data frame.

One way to do what I want is with the following code. The problem is that in the actual data, the first data frame has around 2 million rows, and the second one has around 8 million rows. So, It takes way too much time for this for loop to end.

store_info <- numeric(nrow(data_frame_1))
start_time = Sys.time()
for (i in seq(1:dim(data_frame_1)[1]) ) {
  
  users_id = data_frame_1$user_id1[i]
  date = data_frame_1$date_1[i]
  
  store_info[i] = length(data_frame_2[data_frame_2$user_id2 == users_id &
                           data_frame_2$date_2 < date, "job_id"])
  
}
end_time = Sys.time()
(diff_time = end_time - start_time)

Because of that problem with the time it takes to run this code, I wonder if anyone here knows a better way (a faster way) to solve this problem. With some vectorization for example, or maybe making a join (I tried this, but the resulting table is too big, but there might be a better way). Thanks in advance!!

Here is some example data:

start_date <- as.Date("2021-01-01")
end_date <- as.Date("2022-12-31")
dates_pull = seq(start_date, end_date, by = "day")
random_dates <- sample(dates_pull, 80000, replace = TRUE)

# generate random IDs for users
possible_chars <- c(letters, LETTERS, 0:9)
num_ids <- 800 # number of IDs to generate
id_length <- 8 # length of each ID
random_ids <- replicate(num_ids, paste0(sample(possible_chars, 
                                id_length, replace = TRUE), collapse = ""))
sample_ids <- sample(random_ids, 80000, replace = TRUE)

data_frame_1 = data.frame(user_id1 = sample_ids, date_1 = random_dates)

# generate random IDs for jobs
random_ids_j <- replicate(num_ids, paste0(sample(possible_chars, id_length, replace = TRUE), collapse = ""))
sample_ids_j <- sample(random_ids, 500000, replace = TRUE)

random_dates <- sample(dates_pull, 500000, replace = TRUE)
sample_ids <- sample(random_ids, 500000, replace = TRUE)

data_frame_2 = data.frame(user_id2 = sample_ids, date_2 = random_dates,
                          job_id = sample_ids_j)

I tried the next code, but it is way too slow.

store_info <- numeric(nrow(data_frame_1))
start_time = Sys.time()
for (i in seq(1:dim(data_frame_1)[1]) ) {
  
  users_id = data_frame_1$user_id1[i]
  date = data_frame_1$date_1[i]
  
  store_info[i] = length(data_frame_2[data_frame_2$user_id2 == users_id &
                           data_frame_2$date_2 < date, "job_id"])
  
}
end_time = Sys.time()
(diff_time = end_time - start_time)

Solution

  • We could do a non-equi join

    library(data.table)
    out <- setDT(data_frame_2)[data_frame_1, .(.N), 
       on =  .(user_id2 = user_id1, date_2 < date_1), by = .EACHI]