Search code examples
rdplyr

Use R dplyr to calculate N averages between two dates in a time series


I am trying to calculate averages of a value in a time series between many start and end dates.

A simple set up:

time_series <- tibble(dates=seq.Date(from=ymd('20240101'), to=ymd('20241231'), 'day'),
                      value = runif(366,max=10))
                      

target_dates <-tibble(id=c('A','B','C'),
                      start=ymd(c('20240104','20240504','20240704')),
                      end  =ymd(c('20240201','20240524','20240924'))) %>% 
               mutate(ndays=as.integer(end-start))

I'd like to calculate the mean of time_series$value between each target_dates$start and target_dates$end.

I'm unable to figure how to do this from two separate tibbles.

My first thought was this (and variations on this attempt).

target_dates %>% 
  group_by(start) %>% 
  summarise(avg = mean(time_series %>% slice(first(which(dates==start)):slice(last(which(dates==end)))) %>% select(values)))

But, that didn't work.

This also came to mind but does not work b/c because the between function does not work on vectors. I could write a for loop, but that's not elegant.

time_series %>% 
  summarise(avg = mean(value[between(target_dates$start,target_dates$end)]))

The column ndays is calculated because I tried using slide_idx_double from the slider library but could not figure out how to add the time series data to a statement that started with

time_series %>%
    summarise(avg = slide_index_dbl(WHAT GOES HERE??,start,.after=ndays,mean))

What I'm expecting is a tibble/data frame to be created in which the average of the time series time_series tibble value column is calculated between the target_dates start and end dates.

Help is appreciated.


Solution

  • See if joining works with your data

    library(dplyr)
    
    right_join(time_series, target_dates, join_by(between(dates, start, end))) %>% 
      summarize(Mean = mean(value), .by = c(id, start, end))
    # A tibble: 3 × 4
      id    start      end         Mean
      <chr> <date>     <date>     <dbl>
    1 A     2024-01-04 2024-02-01  5.23
    2 B     2024-05-04 2024-05-24  5.09
    3 C     2024-07-04 2024-09-24  4.96
    

    Note, using set.seed to get a consistent data set

    Data

    set.seed(1)
    
    time_series <- tibble(dates=seq.Date(from=ymd('20240101'), 
                     to=ymd('20241231'), 'day'),
                          value = runif(366,max=10))
                          
    
    target_dates <-tibble(id=c('A','B','C'),
                          start=ymd(c('20240104','20240504','20240704')),
                          end  =ymd(c('20240201','20240524','20240924'))) %>% 
                   mutate(ndays=as.integer(end-start))