I have data with varying start and end dates.
mydata <- data.frame(id=c(1,2,3), start=c("2010/01/01","2010/01/01","2010/01/02"), end=c("2010/01/01","2010/01/05","2010/01/03"), a=c(140,750,56),b=c(48,25,36))
mydata
id start end a b
1 1 2010-01-01 2010-01-01 140 48
2 2 2010-01-01 2010-01-05 750 25
3 3 2010-01-02 2010-01-03 56 36
I want to find the average of the variables a and b for each day. Below I execute it by expanding every row with different start and end dates, then collapsing it back to the daily level.
mydata$subt <- as.numeric(as.Date(mydata$end, "%Y/%m/%d") - as.Date(mydata$start, "%Y/%m/%d") + 1)
require(data.table)
mydata <- setDT(mydata)[ , list(idnum = id, date = seq(start, end, by = "day"), a=a/subt, b=b/subt), by = 1:nrow(mydata)]
mydata
nrow idnum date a b
1: 1 1 2010-01-01 140 48
2: 2 2 2010-01-01 150 5
3: 2 2 2010-01-02 150 5
4: 2 2 2010-01-03 150 5
5: 2 2 2010-01-04 150 5
6: 2 2 2010-01-05 150 5
7: 3 3 2010-01-02 28 18
8: 3 3 2010-01-03 28 18
mydata %>%
group_by(date) %>%
summarize(a = sum(a),
b = sum(b))
Desired Outcome:
date a b
<date> <dbl> <dbl>
1 2010-01-01 290 53
2 2010-01-02 178 23
3 2010-01-03 178 23
4 2010-01-04 150 5
5 2010-01-05 150 5
However, I have plenty of rows with different start and end dates, and sometimes the length of difference is very long. I am wondering if there is an easier way (i.e., without expanding every row) to find the daily averages for each variable. It would also be great if there is a way to find the weekly averages without first finding the daily figures. Thank you!
Here is an option with tidyverse
. We convert the 'start' 'end' columns to Date
class with ymd
(from lubridate
), create a seq
uence of dates from 'start' to 'end' for corresponding elements with map2
, mutate
the 'a', 'b' by dividing them with the lengths
of the list
column 'date', unnest
the 'date' and grouped by 'date' we get the sum
of 'a', 'b'
library(dplyr)
library(tidyr)
library(lubridate)
library(purrr)
mydata %>%
mutate(across(c(start, end), ymd)) %>%
transmute(id, date = map2(start, end, seq, by = 'day'), a, b) %>%
mutate(across(c(a, b), ~ ./lengths(date))) %>%
unnest(date) %>%
group_by(date) %>%
summarise(across(c(a, b), sum, na.rm = TRUE))
# A tibble: 5 x 3
# date a b
# <date> <dbl> <dbl>
#1 2010-01-01 290 53
#2 2010-01-02 178 23
#3 2010-01-03 178 23
#4 2010-01-04 150 5
#5 2010-01-05 150 5