I have a table includes id, id_create_date and id_closed_date. Assume id_create_date starts from 2021-01-01 to today. In some of the days there is no id creation or id closing.
ID | id_create_date | id_closed_date |
---|---|---|
1 | 2021-03-01 | 2021-03-01 |
2 | 2021-03-02 | NA |
3 | 2021-03-04 | 2021-04-11 |
4 | 2021-03-05 | 2021-03-22 |
I would like to have a summary table to count how many id created, how many id closed, how many id still open day by day from the beginning of the year to today.
Date | number of opened | number of closed | number of still open |
---|---|---|---|
2021-03-01 | 1 | 1 | 0 |
2021-03-02 | 1 | 0 | 1 |
2021-03-03 | 0 | 0 | 1 |
2021-03-04 | 1 | 0 | 2 |
2021-03-05 | 1 | 0 | 3 |
2021-03-06 | . | . | . |
. | . | . | . |
. | . | . | . |
2021-04-11 | 1 | 0 | 1 |
To obtain desired table I created calendar table and I tried to join it with the main table. There is a work around using cross join in sql. But I am looking for a way to do this with data.table or dplyr or something different.
The data size huge so I used data.table to manipulate the data. However I cannot create the way how can count the ids based on calendar date with inequity join.
I tried many thing but could not find the solution.
test3 <- test1[test2, on = .(calender_date>= id_created_at),allow.cartesian=TRUE, nomatch = 0]
Only way that I found is loop. But table is too big so the loop does not come to end. The below code is loop version.
newtable=data.frame(matrix(ncol = 4, nrow = 0),stringsAsFactors = FALSE)
start.time <- Sys.time()
i=1
while(i<=length(calendar[calendar<=Sys.Date()])){
numberofids=length(unique(
inbouds$id_id[which(inbouds$id_created_at==calendar[i])] ))
numberofclosedids=length(unique(
inbouds$id_id[which(inbouds$id_closed_at==calendar[i])] ))
numberofPendingids=length(unique(
inbouds$id_id[which(inbouds$id_created_at<calendar[i] &
(is.na(inbouds$id_closed_at) | inbouds$id_closed_at>=calendar[i]))] ))
subdata=cbind(as.character(calendar[i]),as.numeric(numberofids),as.numeric(numberofclosedids),as.numeric(numberofPendingids))
subdata= as.data.frame(subdata,stringsAsFactors = FALSE)
newtable=rbind(newtable,subdata)
i=i+1
}
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
newtable=setNames(newtable, c("date", "#ofidOpened", "#ofidClosed", "#ofidPending"))
newtable$date= as.character(newtable$date)
newtable$`#ofidOpened`= as.numeric(newtable$`#ofidOpened`)
newtable$`#ofidClosed`= as.numeric(newtable$`#ofidClosed`)
newtable$`#ofidPending`= as.numeric(newtable$`#ofidPending`)
Here is an option using non-equi join:
DT[is.na(id_closed_date), id_closed_date := as.IDate("9999-12-31")]
d <- seq(as.IDate("2021-03-01"), as.IDate("2021-04-11"), by="1 day")
DT[.(d=d), on=.(id_create_date<=d, id_closed_date>=d), by=.EACHI,
{
nc <- sum(x.id_closed_date==i.d)
.(num_open=sum(x.id_create_date==i.d),
num_closed=nc,
num_active=.N - nc)
}]
output:
id_create_date id_closed_date num_open num_closed num_active
1: 2021-03-01 2021-03-01 1 1 0
2: 2021-03-02 2021-03-02 1 0 1
3: 2021-03-03 2021-03-03 0 0 1
4: 2021-03-04 2021-03-04 1 0 2
5: 2021-03-05 2021-03-05 1 0 3
6: 2021-03-06 2021-03-06 0 0 3
7: 2021-03-07 2021-03-07 0 0 3
8: 2021-03-08 2021-03-08 0 0 3
9: 2021-03-09 2021-03-09 0 0 3
10: 2021-03-10 2021-03-10 0 0 3
11: 2021-03-11 2021-03-11 0 0 3
12: 2021-03-12 2021-03-12 0 0 3
13: 2021-03-13 2021-03-13 0 0 3
14: 2021-03-14 2021-03-14 0 0 3
15: 2021-03-15 2021-03-15 0 0 3
16: 2021-03-16 2021-03-16 0 0 3
17: 2021-03-17 2021-03-17 0 0 3
18: 2021-03-18 2021-03-18 0 0 3
19: 2021-03-19 2021-03-19 0 0 3
20: 2021-03-20 2021-03-20 0 0 3
21: 2021-03-21 2021-03-21 0 0 3
22: 2021-03-22 2021-03-22 0 1 2
23: 2021-03-23 2021-03-23 0 0 2
24: 2021-03-24 2021-03-24 0 0 2
25: 2021-03-25 2021-03-25 0 0 2
26: 2021-03-26 2021-03-26 0 0 2
27: 2021-03-27 2021-03-27 0 0 2
28: 2021-03-28 2021-03-28 0 0 2
29: 2021-03-29 2021-03-29 0 0 2
30: 2021-03-30 2021-03-30 0 0 2
31: 2021-03-31 2021-03-31 0 0 2
32: 2021-04-01 2021-04-01 0 0 2
33: 2021-04-02 2021-04-02 0 0 2
34: 2021-04-03 2021-04-03 0 0 2
35: 2021-04-04 2021-04-04 0 0 2
36: 2021-04-05 2021-04-05 0 0 2
37: 2021-04-06 2021-04-06 0 0 2
38: 2021-04-07 2021-04-07 0 0 2
39: 2021-04-08 2021-04-08 0 0 2
40: 2021-04-09 2021-04-09 0 0 2
41: 2021-04-10 2021-04-10 0 0 2
42: 2021-04-11 2021-04-11 0 1 1
id_create_date id_closed_date num_open num_closed num_active
data:
library(data.table)
DT <- fread("ID id_create_date id_closed_date
1 2021-03-01 2021-03-01
2 2021-03-02 NA
3 2021-03-04 2021-04-11
4 2021-03-05 2021-03-22")
cols <- c("id_create_date", "id_closed_date")
DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]