Search code examples
rdatemissing-datarolling-computation

How to aggregate a variable by date


I have an R dataset with columns:

  • date
  • country_code
  • version
  • money

The variable money is numerical. The variables version and country_code are categorical. Each combination of date, country_code and version defines a unique observation for which the value of money is meassured.

I want to group the variables version and country_code and compute 3 new variables in the dataset which are:

  • money_week: For each observation, this variable stores the aggregation of the money along the previous 7 days.
  • money_15: For each observation, this variable stores the aggregation of the money along the previous 15 days
  • money_total: For each observation, this variable stores the aggregation of the money up to the current date.

The different combinations of country_code and version may have different start and end dates, and there also may missing dates.

Example

data = tibble(
date = seq(as.Date('2022-01-01'), as.Date('2022-01-13'), by=1),
       money = 1:13)
data = data[-c(2,3,4,10,11),]

# [Some computations for adding the expected new variables]

# A tibble: 10 × 2
  date       money    money_week
2022-01-01     1          1
2022-01-05     5          6
2022-01-06     6          12
2022-01-07     7          19
2022-01-08     8          26
2022-01-09     9          35        
2022-01-12    12          42
2022-01-13    13          49

Observe that each observation is not simply the sum of the previous 7 observations, but the sum of the previous 7 days. So for example, the observation on date 2022-01-08 is the sum of observation on dates 2022-01-08, 2022-01-07, 2022-01-06, 2022-01-05, and it would include observations 2022-01-04, 2022-01-03, 2022-01-02 but those last are missing.


Solution

  • You can use your date variable as an index for a sliding window function.

    library(slider)
    library(tidyverse)
    
    data %>%
      mutate(money_week = slide_index_dbl(money, date, sum, .before = days(6)))
    #> # A tibble: 8 × 3
    #>   date       money money_week
    #>   <date>     <int>      <dbl>
    #> 1 2022-01-01     1          1
    #> 2 2022-01-05     5          6
    #> 3 2022-01-06     6         12
    #> 4 2022-01-07     7         19
    #> 5 2022-01-08     8         26
    #> 6 2022-01-09     9         35
    #> 7 2022-01-12    12         42
    #> 8 2022-01-13    13         49
    

    More test data to show money_15

    set.seed(123)
    newdata = tibble(
      date = seq(as.Date('2022-01-01'), as.Date('2022-03-31'), by=1),
      money = sample(1:20,90, replace = TRUE)) %>%
      slice_sample(n = 60) %>%
      arrange(date)
    
    
    newdata %>%
      mutate(money_week = slide_index_dbl(money, date, sum, .before = days(6)),
             money_15 = slide_index_dbl(money, date, sum, .before = days(14)),
             money_total = cumsum(money))
    #> # A tibble: 60 × 5
    #>    date       money money_week money_15 money_total
    #>    <date>     <int>      <dbl>    <dbl>       <int>
    #>  1 2022-01-01    15         15       15          15
    #>  2 2022-01-02    19         34       34          34
    #>  3 2022-01-03    14         48       48          48
    #>  4 2022-01-05    10         58       58          58
    #>  5 2022-01-06    18         76       76          76
    #>  6 2022-01-07    11         87       87          87
    #>  7 2022-01-08     5         77       92          92
    #>  8 2022-01-09    20         78      112         112
    #>  9 2022-01-11     5         69      117         117
    #> 10 2022-01-13     9         50      126         126
    #> # ℹ 50 more rows