Search code examples
rdata.tablerolling-computationrolling-average

Rolling avergae for aggregated results in R


I have a database with sales value for individual firms that belong to different industries. In the example dataset below:

set.seed(123)
df <- data.table(year=rep(1980:1984,each=4),sale=sample(100:150,20),ind=sample(LETTERS[1:2],20,replace = TRUE))
df[order(year,ind)]
    year sale ind
 1: 1980  114   A
 2: 1980  102   A
 3: 1980  130   B
 4: 1980  113   B
 5: 1981  136   A
 6: 1981  148   A
 7: 1981  141   B
 8: 1981  142   B
 9: 1982  124   A
10: 1982  125   A
11: 1982  104   A
12: 1982  126   B
13: 1983  108   A
14: 1983  128   A
15: 1983  140   B
16: 1983  127   B
17: 1984  134   A
18: 1984  107   A
19: 1984  106   A
20: 1984  146   B

The column "ind" represents industry and I have omitted the firm identifiers (no use in this example). I want an average defined as follows:

For each year, the desired average is the average of all firms within the industry over the past three years. If the data for past three years is not available, a minimum of two observations is also acceptable.

For example, in the above dataset, if year=1982, and ind=A, there are only two observations for past years (which is still acceptable), so the desired average is the average of all sale values in years 1980 and 1981 for industry A.

If year=1983, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1980, 1981, and 1982 for industry A.

If year=1984, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1981, 1982, and 1983 for industry A.

The desired output, thus, will be as follows:

    year sale ind   mymean
 1: 1980  130   B       NA
 2: 1980  114   A       NA
 3: 1980  113   B       NA
 4: 1980  102   A       NA
 5: 1981  141   B       NA
 6: 1981  142   B       NA
 7: 1981  136   A       NA
 8: 1981  148   A       NA
 9: 1982  124   A 125.0000
10: 1982  125   A 125.0000
11: 1982  126   B 131.5000
12: 1982  104   A 125.0000
13: 1983  140   B 130.4000
14: 1983  127   B 130.4000
15: 1983  108   A 121.8571
16: 1983  128   A 121.8571
17: 1984  134   A 124.7143
18: 1984  107   A 124.7143
19: 1984  146   B 135.2000
20: 1984  106   A 124.7143

A data.table solution is much preferred for fast implementation. Many thanks in advance.


Solution

  • I am not very good in data.table. Here is one tidyverse solution if you like or if you can translate it to data.table

    library(tidyverse)
    
    df %>% group_by(ind, year) %>% 
      summarise(ds = sum(sale),
                dn = n()) %>%
      mutate(ds = (lag(ds,1)+lag(ds,2)+ifelse(is.na(lag(ds,3)), 0, lag(ds,3)))/(lag(dn,1)+lag(dn,2)+ifelse(is.na(lag(dn,3)), 0, lag(dn,3)))
      ) %>% select(ind, year, mymean = ds) %>%
      right_join(df, by = c("ind", "year"))
    
    `summarise()` regrouping output by 'ind' (override with `.groups` argument)
    # A tibble: 20 x 4
       ind    year mymean  sale
       <chr> <int>  <dbl> <int>
     1 A      1980    NA    114
     2 A      1980    NA    102
     3 A      1981    NA    136
     4 A      1981    NA    148
     5 A      1982   125    124
     6 A      1982   125    125
     7 A      1982   125    104
     8 A      1983   122.   108
     9 A      1983   122.   128
    10 A      1984   125.   134
    11 A      1984   125.   107
    12 A      1984   125.   106
    13 B      1980    NA    130
    14 B      1980    NA    113
    15 B      1981    NA    141
    16 B      1981    NA    142
    17 B      1982   132.   126
    18 B      1983   130.   140
    19 B      1983   130.   127
    20 B      1984   135.   146