I have a database with sales value for individual firms that belong to different industries. In the example dataset below:
set.seed(123)
df <- data.table(year=rep(1980:1984,each=4),sale=sample(100:150,20),ind=sample(LETTERS[1:2],20,replace = TRUE))
df[order(year,ind)]
year sale ind
1: 1980 114 A
2: 1980 102 A
3: 1980 130 B
4: 1980 113 B
5: 1981 136 A
6: 1981 148 A
7: 1981 141 B
8: 1981 142 B
9: 1982 124 A
10: 1982 125 A
11: 1982 104 A
12: 1982 126 B
13: 1983 108 A
14: 1983 128 A
15: 1983 140 B
16: 1983 127 B
17: 1984 134 A
18: 1984 107 A
19: 1984 106 A
20: 1984 146 B
The column "ind" represents industry and I have omitted the firm identifiers (no use in this example). I want an average defined as follows:
For each year, the desired average is the average of all firms within the industry over the past three years. If the data for past three years is not available, a minimum of two observations is also acceptable.
For example, in the above dataset, if year=1982, and ind=A, there are only two observations for past years (which is still acceptable), so the desired average is the average of all sale values in years 1980 and 1981 for industry A.
If year=1983, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1980, 1981, and 1982 for industry A.
If year=1984, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1981, 1982, and 1983 for industry A.
The desired output, thus, will be as follows:
year sale ind mymean
1: 1980 130 B NA
2: 1980 114 A NA
3: 1980 113 B NA
4: 1980 102 A NA
5: 1981 141 B NA
6: 1981 142 B NA
7: 1981 136 A NA
8: 1981 148 A NA
9: 1982 124 A 125.0000
10: 1982 125 A 125.0000
11: 1982 126 B 131.5000
12: 1982 104 A 125.0000
13: 1983 140 B 130.4000
14: 1983 127 B 130.4000
15: 1983 108 A 121.8571
16: 1983 128 A 121.8571
17: 1984 134 A 124.7143
18: 1984 107 A 124.7143
19: 1984 146 B 135.2000
20: 1984 106 A 124.7143
A data.table solution is much preferred for fast implementation. Many thanks in advance.
I am not very good in data.table
. Here is one tidyverse
solution if you like or if you can translate it to data.table
library(tidyverse)
df %>% group_by(ind, year) %>%
summarise(ds = sum(sale),
dn = n()) %>%
mutate(ds = (lag(ds,1)+lag(ds,2)+ifelse(is.na(lag(ds,3)), 0, lag(ds,3)))/(lag(dn,1)+lag(dn,2)+ifelse(is.na(lag(dn,3)), 0, lag(dn,3)))
) %>% select(ind, year, mymean = ds) %>%
right_join(df, by = c("ind", "year"))
`summarise()` regrouping output by 'ind' (override with `.groups` argument)
# A tibble: 20 x 4
ind year mymean sale
<chr> <int> <dbl> <int>
1 A 1980 NA 114
2 A 1980 NA 102
3 A 1981 NA 136
4 A 1981 NA 148
5 A 1982 125 124
6 A 1982 125 125
7 A 1982 125 104
8 A 1983 122. 108
9 A 1983 122. 128
10 A 1984 125. 134
11 A 1984 125. 107
12 A 1984 125. 106
13 B 1980 NA 130
14 B 1980 NA 113
15 B 1981 NA 141
16 B 1981 NA 142
17 B 1982 132. 126
18 B 1983 130. 140
19 B 1983 130. 127
20 B 1984 135. 146