Search code examples
rdatedplyrfilterdata.table

Using dplyr or data.table to use dates from one dataframe to filter a separate dataframe


I have one dataframe containing all values (data) collected over time from different users (df). I have a separate dataframe (df1) indicating the start/end date of different events. I would like to use the second dataframe to filter the values in the first (i.e., just keep dates between certain events). My question is identical to this question - Take dates from one dataframe and filter data in another dataframe. However, the two answers there don't work for me. R can't allocate enough memory size to the left_join solution (I have a very large dataframe). The data.table answer almost works, but rather than pasting all dates in the output it only pastes the start date (but seems to work besides this - see below). Any fixes on either solution would be much appreciated. I've kept the same example code as in the initial question for consistency....

user=c(rep('A',7),rep('B',8))
data = seq(1:15)
date = as.Date(c('2016-01-01','2016-01-02','2016-01-03','2016-01-04','2016-01-05','2016-01-06','2016-01-07','2016-01-08','2016-01-09','2016-01-10','2016-01-11','2016-01-12','2016-01-13','2016-01-14','2016-01-15'))
df = data.frame(user,date,data)

df1 =data.frame(user = c('A','B'), start_date = as.Date(c('2016-01-02','2016-01-10')),  end_date = as.Date(c('2016-01-06','2016-01-14')))

Dplyr solution (doesn't work on my data due to memory size):

    library(dplyr)
    df<-left_join(df,df1,by="user")
    df <- df %>% filter(date>=start_date & date<=end_date)

Data.table solution (almost works, but just pastes the start date and not all original dates):

require(data.table)
setDT(df)[df1, .(user,date,data), on=.(user, date>=start_date, date<=end_date)]

Solution

  • This looks to be a non-equi/range join.

    dplyr

    library(dplyr)
    inner_join(df, df1, join_by(user, between(date, start_date, end_date)))
    #    user       date data start_date   end_date
    # 1     A 2016-01-02    2 2016-01-02 2016-01-06
    # 2     A 2016-01-03    3 2016-01-02 2016-01-06
    # 3     A 2016-01-04    4 2016-01-02 2016-01-06
    # 4     A 2016-01-05    5 2016-01-02 2016-01-06
    # 5     A 2016-01-06    6 2016-01-02 2016-01-06
    # 6     B 2016-01-10   10 2016-01-10 2016-01-14
    # 7     B 2016-01-11   11 2016-01-10 2016-01-14
    # 8     B 2016-01-12   12 2016-01-10 2016-01-14
    # 9     B 2016-01-13   13 2016-01-10 2016-01-14
    # 10    B 2016-01-14   14 2016-01-10 2016-01-14
    

    data.table

    library(data.table)
    DT <- as.data.table(df) # setDT is canonical
    DT1 <- as.data.table(df1)
    DT[, d := date
      ][DT1, on = .(user, d >= start_date, d <= end_date)
      ][, c("d", "d.1") := NULL][]
    #       user       date  data
    #     <char>     <Date> <int>
    #  1:      A 2016-01-02     2
    #  2:      A 2016-01-03     3
    #  3:      A 2016-01-04     4
    #  4:      A 2016-01-05     5
    #  5:      A 2016-01-06     6
    #  6:      B 2016-01-10    10
    #  7:      B 2016-01-11    11
    #  8:      B 2016-01-12    12
    #  9:      B 2016-01-13    13
    # 10:      B 2016-01-14    14
    

    We need to take a little more care here since the default is to rename/overwrite some of the join values; frustrating perhaps, but "this is the way".

    Demo without the extra steps:

    DT[DT1, on = .(user, date >= start_date, date <= end_date)]
    #       user       date  data     date.1
    #     <char>     <Date> <int>     <Date>
    #  1:      A 2016-01-02     2 2016-01-06
    #  2:      A 2016-01-02     3 2016-01-06
    #  3:      A 2016-01-02     4 2016-01-06
    #  4:      A 2016-01-02     5 2016-01-06
    #  5:      A 2016-01-02     6 2016-01-06
    #  6:      B 2016-01-10    10 2016-01-14
    #  7:      B 2016-01-10    11 2016-01-14
    #  8:      B 2016-01-10    12 2016-01-14
    #  9:      B 2016-01-10    13 2016-01-14
    # 10:      B 2016-01-10    14 2016-01-14
    

    As you can see, we have lost the original DT$date values, and the values from start_date and end_date have been renamed to d and d.1 (the column name I joined on).