Search code examples
rdatetimedplyrfilteringlubridate

Filter between multiple date ranges


I have a large dataframe with four measurments per second over two weeks. Hence the dataframe is quite large.
I also have two vectors with start and end dates, which define certain time ranges within the dataframe and which I have to filter out.
What I want to do is to extract the data between the start and end dates.
My data looks something like this.

library(lubridate)
library(dplyr)

df <- data.frame(datetime = seq(ymd_hms("2020/01/01 00:00:00"),
                                by = "sec",
                                length.out = 3600),
                 var = rnorm(3600))

And my start/end vector look like this. Here I only added two start/end combinations. But the actual vector contain more values.

start = c(ymd_hms("2020/01/01 00:1:00"), ymd_hms("2020/01/01 00:30:00"))
end = c(ymd_hms("2020/01/01 00:1:04"), ymd_hms("2020/01/01 00:30:04"))

I tried to just filter it, using

filtered <- df %>%
  filter(datetime >= start & datetime <= end)

head(filtered)

             datetime        var
1 2020-01-01 00:01:00 -0.2245330
2 2020-01-01 00:01:02  0.5926424
3 2020-01-01 00:01:04 -0.3824533
4 2020-01-01 00:30:01 -0.7202059
5 2020-01-01 00:30:03 -0.5775794

But it seems to downsample the data, as the filtered dataframe has only three measurments for the first time interval, instead of the expected five values.

If I filter only for the first start and end date, I get the five values.

filtered2 <- df %>%
  filter(datetime >= start[1] & datetime <= end[1])

head(filtered2)

             datetime         var
1 2020-01-01 00:01:00 -0.22453305
2 2020-01-01 00:01:01  1.13452854
3 2020-01-01 00:01:02  0.59264239
4 2020-01-01 00:01:03 -0.03700048
5 2020-01-01 00:01:04 -0.38245332

Where I get stuck is:
Why does the first filtering work, but does not return the expected full daterange?
And how can I get the complete data filtered?

I alreadz tried filter(between(datetime, start, end). This gave me the expected result, but only for the first daterange. It seeems that dplyr::between does not accept vecorts.

Any help would be very welcome.

UPDATE
@ekoam has pointed out correctly that data.table::between works as well. But as dplyr::between, it doesn't like vectors.


Solution

  • With some inspiration from this question on how to Efficient way to filter one data frame by ranges in another, I came up with the following solutions.

    One is a very slow with very large datasets:
    It takes my data provided above and uses rowwise()

    filtered3 <- df %>% 
      rowwise() %>%
      filter(any(datetime >= start & datetime <= end))
    
    

    As I mentioned, with more than 3 million rows in my data, this was very slow.

    Another option, also from the answer linked above, includes using the data.table package, which has an inrange function. This one works much faster.

    library(data.table)
    range <- data.table(start = start, end = end)
    filtered4 <- setDT(df)[datetime %inrange% range]