As you can see from my username I am a complete beginner. I have searched for the answer to this question however I am struggling to find a working example that does anything similar to what I need. This code currently works however it looks like it might take over 64 hours to run.
This code is to help me identify how many active contracts there are on any given day over a 5 year period. my process at the moment is to loop through everyday between 2015-01-01 to 2020-10-05, and then loop through every row of data.frame to see if the start and end dates of the contract fall within the date being looped through.
big_data_new <- read.table(header=TRUE, text="
end_date start_date start_date_N end_date_N
1: 2017-03-16 2016-03-16 16876 17241
2: 2015-07-31 2015-07-08 16624 16647
3: 2016-08-02 2016-07-06 16988 17015
4: 2017-07-18 2017-05-31 17317 17365
5: 2016-10-28 2016-01-07 16807 17102
6: 2017-08-01 2017-06-29 17346 17379")
library(data.table)
library(lubridate)
library(dplyr)
library(tidyverse)
big_data_new <- big_data_new2
big_data_new$start_date <- lubridate:: parse_date_time(big_data_new$start_date, orders = "d-b-Y")
big_data_new$end_date <- lubridate:: parse_date_time(big_data_new$end_date, orders = "d-b-Y")
big_data_new = select(big_data_new, end_date, start_date)
big_data_new$start_date <- as.Date(big_data_new$start_date, "%d-%M-%Y")
big_data_new$end_date <- as.Date(big_data_new$end_date, "%d-%M-%Y")
head(big_data_new)
StartDate<-"2015-01-01"
EndDate<- "2020-10-05"
dates<-seq(as.POSIXct(StartDate, format="%Y-%m-%d"), as.POSIXct(EndDate, format="%Y-%m-%d"), by='days')
df = NULL
for (f in as.list(dates)){
Between22 <- as.Date(f, format="%Y-%m-%d")
class(Between22)
count=0
for(i in 1:nrow(big_data)){
date_start <- as.Date(big_data$start_date_N[i], origin="1970-01-01")
date_end <- as.Date(big_data$end_date_N[i], origin="1970-01-01")
if (between(Between22, date_start, date_end) == TRUE){
count=count +1
}
}
df = rbind(df, data.frame(Between22, count))
}
library(tibble)
library(dplyr)
library(lubridate)
big_data_new <- tibble::tribble(
~end_date, ~start_date, ~start_date_N, ~end_date_N,
"2017-03-16", "2016-03-16", 16876, 17241,
"2015-07-31", "2015-07-08", 16624, 16647,
"2016-08-02", "2016-07-06", 16988, 17015,
"2017-07-18", "2017-05-31", 17317, 17365,
"2016-10-28", "2016-01-07", 16807, 17102,
"2017-08-01", "2017-06-29", 17346, 17379)
## what are min starting and max ending date
date_range_min <- min(big_data_new$start_date_N)
date_range_max <- max(big_data_new$end_date_N)
## this definese the range of dates we are going through
date_vector <- date_range_min:date_range_max
## create storage object which is large enough to contain all results !!
result_vector <- vector(mode = "integer", length = length(date_vector))
for (d in seq_along(date_vector)) {
## logical tests: is date between any starting or ending date?
## if so, each row for which this is true will return TRUE
## sum(TRUE) equals 1, 0 otherwise
result_vector[[d]] <-
sum(big_data_new$start_date_N <= date_vector[[d]] & big_data_new$end_date_N >= date_vector[[d]])
}
result_data_frame <- tibble::tibble(date = date_vector, number_of_contracts = result_vector)
result_data_frame %>% dplyr::mutate(date = lubridate::as_date(date))
One problem is that you loop through the rows of your dataframe which is unnecessary in this case. Another aspect is that you create a result object that is large enough to contain all the results. If you have a (very) large data set and you keep on 'growing' your result object this will lead to lower speed. I can highly recommend Hands-on Programming in R where both points are illustrated. By the way: I could not properly execute your approach without changing variable names. Also the df data frame only had count = 0 as results when I tried.
Edit: When I managed to execute your code it took
Time difference of 17.04095 secs
to execute (without loading packages).
When I executed my solution it took:
Time difference of 0.916553 secs
without loading packages (with same time range as you).
Hopefully this will provide the speed increase that you asked for. If so, I was very glad to help!