I'm having the problem that I want to iterate or get the specific data from the rows of one data frame to then obtain some results from another data frame.
One way to do what I want is with the following code. The problem is that in the actual data, the first data frame has around 2 million rows, and the second one has around 8 million rows. So, It takes way too much time for this for
loop to end.
store_info <- numeric(nrow(data_frame_1))
start_time = Sys.time()
for (i in seq(1:dim(data_frame_1)[1]) ) {
users_id = data_frame_1$user_id1[i]
date = data_frame_1$date_1[i]
store_info[i] = length(data_frame_2[data_frame_2$user_id2 == users_id &
data_frame_2$date_2 < date, "job_id"])
}
end_time = Sys.time()
(diff_time = end_time - start_time)
Because of that problem with the time it takes to run this code, I wonder if anyone here knows a better way (a faster way) to solve this problem. With some vectorization for example, or maybe making a join (I tried this, but the resulting table is too big, but there might be a better way). Thanks in advance!!
Here is some example data:
start_date <- as.Date("2021-01-01")
end_date <- as.Date("2022-12-31")
dates_pull = seq(start_date, end_date, by = "day")
random_dates <- sample(dates_pull, 80000, replace = TRUE)
# generate random IDs for users
possible_chars <- c(letters, LETTERS, 0:9)
num_ids <- 800 # number of IDs to generate
id_length <- 8 # length of each ID
random_ids <- replicate(num_ids, paste0(sample(possible_chars,
id_length, replace = TRUE), collapse = ""))
sample_ids <- sample(random_ids, 80000, replace = TRUE)
data_frame_1 = data.frame(user_id1 = sample_ids, date_1 = random_dates)
# generate random IDs for jobs
random_ids_j <- replicate(num_ids, paste0(sample(possible_chars, id_length, replace = TRUE), collapse = ""))
sample_ids_j <- sample(random_ids, 500000, replace = TRUE)
random_dates <- sample(dates_pull, 500000, replace = TRUE)
sample_ids <- sample(random_ids, 500000, replace = TRUE)
data_frame_2 = data.frame(user_id2 = sample_ids, date_2 = random_dates,
job_id = sample_ids_j)
I tried the next code, but it is way too slow.
store_info <- numeric(nrow(data_frame_1))
start_time = Sys.time()
for (i in seq(1:dim(data_frame_1)[1]) ) {
users_id = data_frame_1$user_id1[i]
date = data_frame_1$date_1[i]
store_info[i] = length(data_frame_2[data_frame_2$user_id2 == users_id &
data_frame_2$date_2 < date, "job_id"])
}
end_time = Sys.time()
(diff_time = end_time - start_time)
We could do a non-equi join
library(data.table)
out <- setDT(data_frame_2)[data_frame_1, .(.N),
on = .(user_id2 = user_id1, date_2 < date_1), by = .EACHI]