In a database containing firm and patent class values, I want to calculate the following variables:
Technological abandonment: Number of previously active technological patent classes abandoned annually.
Specifically, I want to create variables that calculate the number of patent classes (variable = class) that the firm has used in the past 3 years (t-3, t-2, and t-1) (min observation of one year prior is acceptable if the firm history initially doesn't have 3 years) but are missing in this year (t) I would like to do the same with a 5 year window as well.
I have a dataset containing millions of rows, so a fast data.table solution is much preferred.
In the following dataset:
df <- data.table(year=c(1979,1979,1980,1980,1981,1981,1982,1983,1983,1984,1984),
category = c("A","A","B","C","A","D","F","F","C","A","B"))
The desired outcome would be (for a three year window):
year class tech_aband_3
1: 1979 A 0
2: 1979 A 0
3: 1980 B 1
4: 1980 C 1
5: 1981 A 2
6: 1981 D 2
7: 1982 F 4
8: 1983 F 3
9: 1983 C 3
10: 1984 A 3
11: 1984 B 3
I will have to run the solution by firm id in the end.
Many thanks in advance.
Assuming that all years are represented in the data (if not, you'd need to fill missing years for the following to work), you can try:
library(data.table)
df[, .(category = list(unique(category))), by = year
][, tech_aband_3 := lengths(mapply(\(x, y) setdiff(unlist(x), y),
transpose(shift(list(category), 1:3, fill = first(category[[1]]))),
category))
][, .(category = unlist(category)), by = .(year, tech_aband_3)
][ df, on = .(year, category)
]
year tech_aband_3 category
<num> <int> <char>
1: 1979 0 A
2: 1979 0 A
3: 1980 1 B
4: 1980 1 C
5: 1981 2 A
6: 1981 2 D
7: 1982 4 F
8: 1983 3 F
9: 1983 3 C
10: 1984 3 A
11: 1984 3 B