Search code examples
rmergelag

Create lagged vectors based on a different data.frame in a panel in R


I've got two data.frames, one with event data and one with stock data of several companies (here it's only two). I want two additional columns with lagged dates (-1 day and +1 day) for both companies in my event data.frame. The lagged dates should come from my stock data.frame (df) of course. How can i do that?

DATE <- c("01.01.2000","02.01.2000","03.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000","01.01.2000","02.01.2000","04.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df <- data.frame(DATE, RET, COMP)

df

# DATE   RET COMP
# 1  01.01.2000 -2.00    A
# 2  02.01.2000  1.10    A
# 3  03.01.2000  3.00    A
# 4  06.01.2000  1.40    A
# 5  07.01.2000 -0.20    A
# 6  09.01.2000  0.60    A
# 7  10.01.2000  0.10    A
# 8  01.01.2000 -0.21    B
# 9  02.01.2000 -1.20    B
# 10 04.01.2000  0.90    B
# 11 06.01.2000  0.30    B
# 12 07.01.2000 -0.10    B
# 13 09.01.2000  0.30    B
# 14 10.01.2000 -0.12    B

DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")

event <- data.frame(DATE, ARTICLE, COMP)

event

#         DATE  ARTICLE COMP
# 1 02.01.2000 blabla11    A
# 2 03.01.2000 blabla12    A
# 3 06.01.2000 blabla13    A
# 4 09.01.2000 blabla14    A
# 5 06.01.2000 blabla21    B
# 6 07.01.2000 blabla22    B
# 7 09.01.2000 blabla23    B

the output should be my data.frame event with the two additional columns DATEm1 and DATEp1

#         DATE      DATEm1      DATEp1   ARTICLE COMP
# 1 02.01.2000  01.01.2000  03.01.2000  blabla11    A
# 2 03.01.2000  02.01.2000  06.01.2000  blabla12    A
# 3 06.01.2000  03.01.2000  07.01.2000  blabla13    A
# 4 09.01.2000  07.01.2000  10.01.2000  blabla14    A
# 5 06.01.2000  04.01.2000  07.01.2000  blabla21    B
# 6 07.01.2000  06.01.2000  09.01.2000  blabla22    B
# 7 09.01.2000  07.01.2000  10.01.2000  blabla23    B

I have tried the approach in the answer of G. Grothendieck, which works perfectly for this example.

The problem is, my original data.frame contains way more data than this example and the sqldf approach is rather slow and uses a lot of memory (too much for my machine). Does anyone have another solution for this?


Solution

  • Here's another approach using data.table:

    First, we convert df and event to data.tables. Here I'll use as.data.table(.). But you can use setDT if you don't want to make a copy. That is, by doing setDT(df), df will be modified by reference to a data.table.

    require(data.table) ## >= 1.9.2
    dt1 = as.data.table(df)
    dt2 = as.data.table(event)
    

    Then we'll prepare the data as follows:

    key_cols = c("COMP", "DATE")
    setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
    setkeyv(dt1, key_cols)
    

    The setcolorder rearranges the columns of your data.tables by reference. setkeyv sorts the data.table by the given columns in ascending order, and marks the key columns for dt1.

    The column reordering is essential here because, we don't set key on dt2 (because that'll sort dt2 which may be undesirable for you). And since no key is set of dt2, data.table takes the first 'n' (=2 here) columns from dt2 to match with the key columns from dt1.

    Note: A join x[i] using data.table absolutely requires key of x to be set. Here x = dt1. Setting key on i is optional, depending on if you wish the order to be preserved or not.

    Now, we perform two joins and get the corresponding matching indices:

    idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
    idx2 = dt1[dt2, which=TRUE, mult="last"]+1L
    

    The first join gets for each match of dt2 in dt1, the first matching position in dt1. Similarly, the second join gets for each match of dt2 in dt1, the last matching position in dt1. We add -1 and +1 to get the previous and next indices respectively.

    Take care of one special case:

    idx1[idx1 == 0L] = NA
    

    When the matching index is 1, subtracting it will result in 0. Because of R's behaviour on 0-index, we've to explicitly replace it with NA here.

    Now, we can just subset those dates and add it to dt2 by reference as follows:

    dt2[, `:=`(DATEm1 = dt1$DATE[idx1], 
               DATEp1 = dt1$DATE[idx2]
      )]
    
    #    COMP       DATE  ARTICLE     DATEm1     DATEp1
    # 1:    A 02.01.2000 blabla11 01.01.2000 03.01.2000
    # 2:    A 03.01.2000 blabla12 02.01.2000 06.01.2000
    # 3:    A 06.01.2000 blabla13 03.01.2000 07.01.2000
    # 4:    A 09.01.2000 blabla14 07.01.2000 10.01.2000
    # 5:    B 06.01.2000 blabla21 04.01.2000 07.01.2000
    # 6:    B 07.01.2000 blabla22 06.01.2000 09.01.2000
    # 7:    B 09.01.2000 blabla23 07.01.2000 10.01.2000