Search code examples
rjoindata.tablecross-joininequality

R creating a summary table based on date of another table with inequity join


I have a table includes id, id_create_date and id_closed_date. Assume id_create_date starts from 2021-01-01 to today. In some of the days there is no id creation or id closing.

ID id_create_date id_closed_date
1 2021-03-01 2021-03-01
2 2021-03-02 NA
3 2021-03-04 2021-04-11
4 2021-03-05 2021-03-22

I would like to have a summary table to count how many id created, how many id closed, how many id still open day by day from the beginning of the year to today.

Date number of opened number of closed number of still open
2021-03-01 1 1 0
2021-03-02 1 0 1
2021-03-03 0 0 1
2021-03-04 1 0 2
2021-03-05 1 0 3
2021-03-06 . . .
. . . .
. . . .
2021-04-11 1 0 1

To obtain desired table I created calendar table and I tried to join it with the main table. There is a work around using cross join in sql. But I am looking for a way to do this with data.table or dplyr or something different.

The data size huge so I used data.table to manipulate the data. However I cannot create the way how can count the ids based on calendar date with inequity join.

I tried many thing but could not find the solution.

test3 <- test1[test2, on = .(calender_date>= id_created_at),allow.cartesian=TRUE, nomatch = 0]

Only way that I found is loop. But table is too big so the loop does not come to end. The below code is loop version.

  newtable=data.frame(matrix(ncol = 4, nrow = 0),stringsAsFactors = FALSE)
  
  start.time <- Sys.time()
  i=1
  while(i<=length(calendar[calendar<=Sys.Date()])){ 
    
    numberofids=length(unique(
      inbouds$id_id[which(inbouds$id_created_at==calendar[i])] ))
    
    numberofclosedids=length(unique(
      inbouds$id_id[which(inbouds$id_closed_at==calendar[i])] ))
    
    numberofPendingids=length(unique(
      inbouds$id_id[which(inbouds$id_created_at<calendar[i] & 
                                (is.na(inbouds$id_closed_at) | inbouds$id_closed_at>=calendar[i]))] ))
    
    
    subdata=cbind(as.character(calendar[i]),as.numeric(numberofids),as.numeric(numberofclosedids),as.numeric(numberofPendingids))
    subdata= as.data.frame(subdata,stringsAsFactors = FALSE)
    
    newtable=rbind(newtable,subdata)
    
    i=i+1
    
  }  
  
  end.time <- Sys.time()
  time.taken <- end.time - start.time
  time.taken
  
  newtable=setNames(newtable, c("date", "#ofidOpened", "#ofidClosed", "#ofidPending"))
  
  newtable$date= as.character(newtable$date)
  newtable$`#ofidOpened`= as.numeric(newtable$`#ofidOpened`)
  newtable$`#ofidClosed`= as.numeric(newtable$`#ofidClosed`)
  newtable$`#ofidPending`= as.numeric(newtable$`#ofidPending`)

Solution

  • Here is an option using non-equi join:

    DT[is.na(id_closed_date), id_closed_date := as.IDate("9999-12-31")]
    d <- seq(as.IDate("2021-03-01"), as.IDate("2021-04-11"), by="1 day")
    DT[.(d=d), on=.(id_create_date<=d, id_closed_date>=d), by=.EACHI,
        {
            nc <- sum(x.id_closed_date==i.d)
            .(num_open=sum(x.id_create_date==i.d), 
            num_closed=nc, 
            num_active=.N - nc)
        }]
    

    output:

        id_create_date id_closed_date num_open num_closed num_active
     1:     2021-03-01     2021-03-01        1          1          0
     2:     2021-03-02     2021-03-02        1          0          1
     3:     2021-03-03     2021-03-03        0          0          1
     4:     2021-03-04     2021-03-04        1          0          2
     5:     2021-03-05     2021-03-05        1          0          3
     6:     2021-03-06     2021-03-06        0          0          3
     7:     2021-03-07     2021-03-07        0          0          3
     8:     2021-03-08     2021-03-08        0          0          3
     9:     2021-03-09     2021-03-09        0          0          3
    10:     2021-03-10     2021-03-10        0          0          3
    11:     2021-03-11     2021-03-11        0          0          3
    12:     2021-03-12     2021-03-12        0          0          3
    13:     2021-03-13     2021-03-13        0          0          3
    14:     2021-03-14     2021-03-14        0          0          3
    15:     2021-03-15     2021-03-15        0          0          3
    16:     2021-03-16     2021-03-16        0          0          3
    17:     2021-03-17     2021-03-17        0          0          3
    18:     2021-03-18     2021-03-18        0          0          3
    19:     2021-03-19     2021-03-19        0          0          3
    20:     2021-03-20     2021-03-20        0          0          3
    21:     2021-03-21     2021-03-21        0          0          3
    22:     2021-03-22     2021-03-22        0          1          2
    23:     2021-03-23     2021-03-23        0          0          2
    24:     2021-03-24     2021-03-24        0          0          2
    25:     2021-03-25     2021-03-25        0          0          2
    26:     2021-03-26     2021-03-26        0          0          2
    27:     2021-03-27     2021-03-27        0          0          2
    28:     2021-03-28     2021-03-28        0          0          2
    29:     2021-03-29     2021-03-29        0          0          2
    30:     2021-03-30     2021-03-30        0          0          2
    31:     2021-03-31     2021-03-31        0          0          2
    32:     2021-04-01     2021-04-01        0          0          2
    33:     2021-04-02     2021-04-02        0          0          2
    34:     2021-04-03     2021-04-03        0          0          2
    35:     2021-04-04     2021-04-04        0          0          2
    36:     2021-04-05     2021-04-05        0          0          2
    37:     2021-04-06     2021-04-06        0          0          2
    38:     2021-04-07     2021-04-07        0          0          2
    39:     2021-04-08     2021-04-08        0          0          2
    40:     2021-04-09     2021-04-09        0          0          2
    41:     2021-04-10     2021-04-10        0          0          2
    42:     2021-04-11     2021-04-11        0          1          1
        id_create_date id_closed_date num_open num_closed num_active
    

    data:

    library(data.table)
    DT <- fread("ID id_create_date  id_closed_date
    1   2021-03-01  2021-03-01
    2   2021-03-02  NA
    3   2021-03-04  2021-04-11
    4   2021-03-05  2021-03-22")
    cols <- c("id_create_date", "id_closed_date")
    DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]