Search code examples
rdata.tablegroupingdiscountrowwise

Discounting values based on previous appearance in r


I have the following (sample) dataset:

df <- data.table(
  firm = rep(c("A","B"),each=20),
  year = c(1994,1994,1994,1994,1994,1994,1994,1994,1994,1994,2000,2000,2000,2002,2002,2002,2003,2003,2003,2003,1975,1975,1975,1975,1975,1975,1976,1976,1977,1977,1977,1977,1977,1977,1977,1977,1977,1978,1978,1978),
  patent_number=c(5505081,5505081,5606110,5606110,5837890,5837890,5837890,5837890,5837890,5837890,6725912,6725912,6725912,6748800,6748800,6748800,7153136,6997049,6997049,6997049,4026555,4026555,4026555,4026555,4026555,4026555,4155095,4155095,4137556,4137556,4137556,4137556,4137556,4137556,4137556,4137556,4137556,4253157,4253157,4253157),
  class=c("73","73","73","73","73","73","73","73","73","73","165","165","165","73","73","73","434","73","73","73","463","463","463","463","463","463","348","348","361","361","361","361","361","361","361","361","361","707","707","707"),
  sub_class=c("147","12.07","12.08","147","116.03","184","185","186","198","210","144",
                             "147","140","147","E29.272","E29.309","59","147","E29.272","E29.309","3","168",
                             "473","960","552","31","701","593","91.2","72","58","59","111",
                             "222","93.05","117","709","104.1","93.25","552"),
  sub_new_I = c(0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1)
  )
> df
   firm year patent_number class sub_class sub_new_I
 1:    A 1994       5505081    73       147         0
 2:    A 1994       5505081    73     12.07         0
 3:    A 1994       5606110    73     12.08         0
 4:    A 1994       5606110    73       147         0
 5:    A 1994       5837890    73    116.03         0
 6:    A 1994       5837890    73       184         0
 7:    A 1994       5837890    73       185         0
 8:    A 1994       5837890    73       186         0
 9:    A 1994       5837890    73       198         0
10:    A 1994       5837890    73       210         0
11:    A 2000       6725912   165       144         0
12:    A 2000       6725912   165       147         1
13:    A 2000       6725912   165       140         0
14:    A 2002       6748800    73       147         1
15:    A 2002       6748800    73   E29.272         0
16:    A 2002       6748800    73   E29.309         0
17:    A 2003       7153136   434        59         0
18:    A 2003       6997049    73       147         1
19:    A 2003       6997049    73   E29.272         1
20:    A 2003       6997049    73   E29.309         1
21:    B 1975       4026555   463         3         0
22:    B 1975       4026555   463       168         0
23:    B 1975       4026555   463       473         0
24:    B 1975       4026555   463       960         0
25:    B 1975       4026555   463       552         0
26:    B 1975       4026555   463        31         0
27:    B 1976       4155095   348       701         0
28:    B 1976       4155095   348       593         0
29:    B 1977       4137556   361      91.2         0
30:    B 1977       4137556   361        72         0
31:    B 1977       4137556   361        58         0
32:    B 1977       4137556   361        59         0
33:    B 1977       4137556   361       111         0
34:    B 1977       4137556   361       222         0
35:    B 1977       4137556   361     93.05         0
36:    B 1977       4137556   361       117         0
37:    B 1977       4137556   361       709         0
38:    B 1978       4253157   707     104.1         0
39:    B 1978       4253157   707     93.25         0
40:    B 1978       4253157   707       552         1

The column sub_new_I indicates whether a firm has previously filed a patent in that specific subclass (by a binary indicator).

My goal is to apply a time discount to decrease the value of more distant history. The logic is that a subclass that appears 10 years ago is different from the one that was filed 2 years ago. To do this, I want to rely on an exponential discount with a power less than 1, explained as follows: For example, Firm A has filed a patent (with patent_number = 6725912) in 2000 which includes subclass 147. The same subclass appears in year 1994 as well (patent_number=5606110). The indicator (sub_class_I) is therefore equal to 1 in year 2000. But applying the time discount, I want the indicator to be exp(-( (2000-1994)/5 ) )=0.301. Essentially, the discount factor is given by this formula:

formula

Here, the constant for knowledge loss is equal to 5. (This means that the denominator is equal to 5 in all calculations).

The desired output is, therefore:

> df
    firm year patent_number class sub_class sub_new_I discounted_I
 1:    A 1994       5505081    73       147         0       0.0000
 2:    A 1994       5505081    73     12.07         0       0.0000
 3:    A 1994       5606110    73     12.08         0       0.0000
 4:    A 1994       5606110    73       147         0       0.0000
 5:    A 1994       5837890    73    116.03         0       0.0000
 6:    A 1994       5837890    73       184         0       0.0000
 7:    A 1994       5837890    73       185         0       0.0000
 8:    A 1994       5837890    73       186         0       0.0000
 9:    A 1994       5837890    73       198         0       0.0000
10:    A 1994       5837890    73       210         0       0.0000
11:    A 2000       6725912   165       144         0       0.0000
12:    A 2000       6725912   165       147         1       0.3012
13:    A 2000       6725912   165       140         0       0.0000
14:    A 2002       6748800    73       147         1       0.6703
15:    A 2002       6748800    73   E29.272         0       0.0000
16:    A 2002       6748800    73   E29.309         0       0.0000
17:    A 2003       7153136   434        59         0       0.0000
18:    A 2003       6997049    73       147         1       0.8187
19:    A 2003       6997049    73   E29.272         1       0.8187
20:    A 2003       6997049    73   E29.309         1       0.8187
21:    B 1975       4026555   463         3         0       0.0000
22:    B 1975       4026555   463       168         0       0.0000
23:    B 1975       4026555   463       473         0       0.0000
24:    B 1975       4026555   463       960         0       0.0000
25:    B 1975       4026555   463       552         0       0.0000
26:    B 1975       4026555   463        31         0       0.0000
27:    B 1976       4155095   348       701         0       0.0000
28:    B 1976       4155095   348       593         0       0.0000
29:    B 1977       4137556   361      91.2         0       0.0000
30:    B 1977       4137556   361        72         0       0.0000
31:    B 1977       4137556   361        58         0       0.0000
32:    B 1977       4137556   361        59         0       0.0000
33:    B 1977       4137556   361       111         0       0.0000
34:    B 1977       4137556   361       222         0       0.0000
35:    B 1977       4137556   361     93.05         0       0.0000
36:    B 1977       4137556   361       117         0       0.0000
37:    B 1977       4137556   361       709         0       0.0000
38:    B 1978       4253157   707     104.1         0       0.0000
39:    B 1978       4253157   707     93.25         0       0.0000
40:    B 1978       4253157   707       552         1       0.5488

My dataset is big, so a data.table solution greatly appreciated.

Thanks in advance for any help.


Solution

  • Create a row index column on the dataset with .I, subset the data for thos 'sub_class' where 'sub_new_I' is 1, get the unique rows by 'firm', 'year', 'subclass', then grouped by 'firm', 'sub_class', create the discounted_I column by taking the lag of negative exp of difference between the next 'year' (lead) and the 'year' divided by 5 ('newdf'). Then, do a join with the original data on 'rn'

    df[, rn := .I]
    newdf <- unique(df[df[, sub_class %in% sub_class[sub_new_I == 1]]],
         by = c('firm', 'year', 'sub_class'))[,  discounted_I := 
             shift(exp(-(shift(year, type = 'lead') - 
               year)/5)), .(firm, sub_class)]
    df[newdf, discounted_I := discounted_I, on = .(rn)]
    

    -output

       firm year patent_number class sub_class sub_new_I rn discounted_I
     1:    A 1994       5505081    73       147         0  1           NA
     2:    A 1994       5505081    73     12.07         0  2           NA
     3:    A 1994       5606110    73     12.08         0  3           NA
     4:    A 1994       5606110    73       147         0  4           NA
     5:    A 1994       5837890    73    116.03         0  5           NA
     6:    A 1994       5837890    73       184         0  6           NA
     7:    A 1994       5837890    73       185         0  7           NA
     8:    A 1994       5837890    73       186         0  8           NA
     9:    A 1994       5837890    73       198         0  9           NA
    10:    A 1994       5837890    73       210         0 10           NA
    11:    A 2000       6725912   165       144         0 11           NA
    12:    A 2000       6725912   165       147         1 12    0.3011942
    13:    A 2000       6725912   165       140         0 13           NA
    14:    A 2002       6748800    73       147         1 14    0.6703200
    15:    A 2002       6748800    73   E29.272         0 15           NA
    16:    A 2002       6748800    73   E29.309         0 16           NA
    17:    A 2003       7153136   434        59         0 17           NA
    18:    A 2003       6997049    73       147         1 18    0.8187308
    19:    A 2003       6997049    73   E29.272         1 19    0.8187308
    20:    A 2003       6997049    73   E29.309         1 20    0.8187308
    21:    B 1975       4026555   463         3         0 21           NA
    22:    B 1975       4026555   463       168         0 22           NA
    23:    B 1975       4026555   463       473         0 23           NA
    24:    B 1975       4026555   463       960         0 24           NA
    25:    B 1975       4026555   463       552         0 25           NA
    26:    B 1975       4026555   463        31         0 26           NA
    27:    B 1976       4155095   348       701         0 27           NA
    28:    B 1976       4155095   348       593         0 28           NA
    29:    B 1977       4137556   361      91.2         0 29           NA
    30:    B 1977       4137556   361        72         0 30           NA
    31:    B 1977       4137556   361        58         0 31           NA
    32:    B 1977       4137556   361        59         1 32           NA
    33:    B 1977       4137556   361       111         0 33           NA
    34:    B 1977       4137556   361       222         0 34           NA
    35:    B 1977       4137556   361     93.05         0 35           NA
    36:    B 1977       4137556   361       117         0 36           NA
    37:    B 1977       4137556   361       709         0 37           NA
    38:    B 1978       4253157   707     104.1         0 38           NA
    39:    B 1978       4253157   707     93.25         0 39           NA
    40:    B 1978       4253157   707       552         1 40    0.5488116