Search code examples
rgroup-bydata.tablerowrowwise

Comparing the values of a certain number previous rows with the current row


In a database containing firm and patent class values, I want to calculate the following variables:

Technological abandonment: Number of previously active technological patent classes abandoned annually.

Specifically, I want to create variables that calculate the number of patent classes (variable = class) that the firm has used in the past 3 years (t-3, t-2, and t-1) (min observation of one year prior is acceptable if the firm history initially doesn't have 3 years) but are missing in this year (t) I would like to do the same with a 5 year window as well.

I have a dataset containing millions of rows, so a fast data.table solution is much preferred.

In the following dataset:

df <- data.table(year=c(1979,1979,1980,1980,1981,1981,1982,1983,1983,1984,1984),
                 category = c("A","A","B","C","A","D","F","F","C","A","B"))

The desired outcome would be (for a three year window):

    year        class tech_aband_3
 1: 1979        A     0
 2: 1979        A     0
 3: 1980        B     1
 4: 1980        C     1
 5: 1981        A     2
 6: 1981        D     2
 7: 1982        F     4
 8: 1983        F     3
 9: 1983        C     3
10: 1984        A     3
11: 1984        B     3

I will have to run the solution by firm id in the end.

Many thanks in advance.


Solution

  • Assuming that all years are represented in the data (if not, you'd need to fill missing years for the following to work), you can try:

    library(data.table)  
      
    df[, .(category = list(unique(category))), by = year
       ][, tech_aband_3 := lengths(mapply(\(x, y) setdiff(unlist(x), y), 
                                          transpose(shift(list(category), 1:3, fill = first(category[[1]]))), 
                                          category))
         ][, .(category = unlist(category)), by = .(year, tech_aband_3)
           ][ df, on = .(year, category)
           ]
    
         year tech_aband_3 category
        <num>        <int>   <char>
     1:  1979            0        A
     2:  1979            0        A
     3:  1980            1        B
     4:  1980            1        C
     5:  1981            2        A
     6:  1981            2        D
     7:  1982            4        F
     8:  1983            3        F
     9:  1983            3        C
    10:  1984            3        A
    11:  1984            3        B