Search code examples
rdplyrwindowcumsum

Lagged rolling interval window in dplyr


I have a dataset like df, where year_month represents the year in the first 2 digits, and the month in the last 2 digits (1901=01/2019).

id <- c(1,1,1,1,1,2,2,2,2,2)
year_month <- c(1801,1809,1812,1901,1908, 1901,1909,1912,2011,2012 )
value <- c(10,20,30,40,50,60,70,80,90,100)

df <- data.frame(id, year_month, value)

I want to sum the accumulated value of each firm during the last year (e.g. if we are in 1901=01/2019, I want to start counting since 1801 (01/2018) and record the history of values prior to 1901, i.e. count from 1801 until 1812). The dataset will look like one_year_window

value_last_year <- c(NA,10,30,60,90,NA,60,130,80,170)

one_year_window <- data.frame(id, year_month, value, value_last_year)

I have tried several ways (cummsum()) in dplyr without success.

Any clue?


Solution

  • df %>% 
      group_by(id, year = round(year_month, -2)) %>% 
      mutate(value_last_year = cumsum(value)) %>%
      ungroup(year) %>% 
      mutate(value_last_year = ifelse(year == max(year), value, NA)) %>% 
      ungroup() %>%
      select(-year)
    

    Though it may be to your advantage to compute the cumulative sum across all years and then filter by id and year_month if you need their most recent records.

    Output

          id year_month value value_last_year
       <dbl>      <dbl> <dbl>           <dbl>
     1     1       1801    10              NA
     2     1       1809    20              NA
     3     1       1812    30              NA
     4     1       1901    40              40
     5     1       1908    50              50
     6     2       1901    60              NA
     7     2       1909    70              NA
     8     2       1912    80              NA
     9     2       2011    90              90
    10     2       2012   100             100
    

    Update

    Based on your updated post. I do not take credit for this answer. You should follow this link and upvote the answer there that this is based on:

    library(dplyr)
    library(lubridate)
    library(zoo)
    
    df %>% 
      mutate(yrmd = ymd(paste0(year_month, "01"))) %>% 
      group_by(id) %>% 
      mutate(value_last_year = rollapplyr(value, width = 1:n() - findInterval(yrmd %m-% months(13), yrmd), sum),
             value_last_year = ifelse(row_number() == 1, NA, value_last_year - value)) %>% 
      select(-yrmd) %>% 
      ungroup()
    

    Output

          id year_month value value_last_year
       <dbl>      <dbl> <dbl>           <dbl>
     1     1       1801    10              NA
     2     1       1809    20              30
     3     1       1812    30              60
     4     1       1901    40              90
     5     1       1908    50             140
     6     2       1901    60              NA
     7     2       1909    70             130
     8     2       1912    80             210
     9     2       2011    90             170
    10     2       2012   100             190