Search code examples
rdateaverage

Given different start and end dates, find the daily average of each variable


I have data with varying start and end dates.

mydata <- data.frame(id=c(1,2,3), start=c("2010/01/01","2010/01/01","2010/01/02"), end=c("2010/01/01","2010/01/05","2010/01/03"), a=c(140,750,56),b=c(48,25,36))
mydata

  id      start        end   a  b
1  1 2010-01-01 2010-01-01 140 48
2  2 2010-01-01 2010-01-05 750 25
3  3 2010-01-02 2010-01-03  56 36

I want to find the average of the variables a and b for each day. Below I execute it by expanding every row with different start and end dates, then collapsing it back to the daily level.

mydata$subt <- as.numeric(as.Date(mydata$end, "%Y/%m/%d") - as.Date(mydata$start, "%Y/%m/%d") + 1)

require(data.table) 
mydata <- setDT(mydata)[ , list(idnum = id, date = seq(start, end, by = "day"), a=a/subt, b=b/subt), by = 1:nrow(mydata)]
mydata

   nrow idnum       date   a  b
1:    1     1 2010-01-01 140 48
2:    2     2 2010-01-01 150  5
3:    2     2 2010-01-02 150  5
4:    2     2 2010-01-03 150  5
5:    2     2 2010-01-04 150  5
6:    2     2 2010-01-05 150  5
7:    3     3 2010-01-02  28 18
8:    3     3 2010-01-03  28 18

mydata %>%
  group_by(date) %>%
  summarize(a = sum(a),
            b = sum(b))

Desired Outcome:

  date           a     b
  <date>     <dbl> <dbl>
1 2010-01-01   290    53
2 2010-01-02   178    23
3 2010-01-03   178    23
4 2010-01-04   150     5
5 2010-01-05   150     5

However, I have plenty of rows with different start and end dates, and sometimes the length of difference is very long. I am wondering if there is an easier way (i.e., without expanding every row) to find the daily averages for each variable. It would also be great if there is a way to find the weekly averages without first finding the daily figures. Thank you!


Solution

  • Here is an option with tidyverse. We convert the 'start' 'end' columns to Date class with ymd (from lubridate), create a sequence of dates from 'start' to 'end' for corresponding elements with map2, mutate the 'a', 'b' by dividing them with the lengths of the list column 'date', unnest the 'date' and grouped by 'date' we get the sum of 'a', 'b'

    library(dplyr)
    library(tidyr)
    library(lubridate)
    library(purrr)
    mydata %>%
         mutate(across(c(start, end), ymd)) %>% 
         transmute(id, date = map2(start, end, seq, by = 'day'), a, b) %>% 
         mutate(across(c(a, b), ~ ./lengths(date)))  %>%
         unnest(date) %>%
         group_by(date) %>% 
         summarise(across(c(a, b), sum, na.rm = TRUE))
    # A tibble: 5 x 3
    #  date           a     b
    #  <date>     <dbl> <dbl>
    #1 2010-01-01   290    53
    #2 2010-01-02   178    23
    #3 2010-01-03   178    23
    #4 2010-01-04   150     5
    #5 2010-01-05   150     5