I've got two data.frames, one with event data and one with stock data of several companies (here it's only two). I want two additional columns with lagged dates (-1 day and +1 day) for both companies in my event data.frame. The lagged dates should come from my stock data.frame (df) of course. How can i do that?
DATE <- c("01.01.2000","02.01.2000","03.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000","01.01.2000","02.01.2000","04.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df <- data.frame(DATE, RET, COMP)
df
# DATE RET COMP
# 1 01.01.2000 -2.00 A
# 2 02.01.2000 1.10 A
# 3 03.01.2000 3.00 A
# 4 06.01.2000 1.40 A
# 5 07.01.2000 -0.20 A
# 6 09.01.2000 0.60 A
# 7 10.01.2000 0.10 A
# 8 01.01.2000 -0.21 B
# 9 02.01.2000 -1.20 B
# 10 04.01.2000 0.90 B
# 11 06.01.2000 0.30 B
# 12 07.01.2000 -0.10 B
# 13 09.01.2000 0.30 B
# 14 10.01.2000 -0.12 B
DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")
event <- data.frame(DATE, ARTICLE, COMP)
event
# DATE ARTICLE COMP
# 1 02.01.2000 blabla11 A
# 2 03.01.2000 blabla12 A
# 3 06.01.2000 blabla13 A
# 4 09.01.2000 blabla14 A
# 5 06.01.2000 blabla21 B
# 6 07.01.2000 blabla22 B
# 7 09.01.2000 blabla23 B
the output should be my data.frame event with the two additional columns DATEm1 and DATEp1
# DATE DATEm1 DATEp1 ARTICLE COMP
# 1 02.01.2000 01.01.2000 03.01.2000 blabla11 A
# 2 03.01.2000 02.01.2000 06.01.2000 blabla12 A
# 3 06.01.2000 03.01.2000 07.01.2000 blabla13 A
# 4 09.01.2000 07.01.2000 10.01.2000 blabla14 A
# 5 06.01.2000 04.01.2000 07.01.2000 blabla21 B
# 6 07.01.2000 06.01.2000 09.01.2000 blabla22 B
# 7 09.01.2000 07.01.2000 10.01.2000 blabla23 B
I have tried the approach in the answer of G. Grothendieck, which works perfectly for this example.
The problem is, my original data.frame contains way more data than this example and the sqldf approach is rather slow and uses a lot of memory (too much for my machine). Does anyone have another solution for this?
Here's another approach using data.table
:
First, we convert df
and event
to data.table
s. Here I'll use as.data.table(.)
. But you can use setDT
if you don't want to make a copy. That is, by doing setDT(df)
, df
will be modified by reference to a data.table
.
require(data.table) ## >= 1.9.2
dt1 = as.data.table(df)
dt2 = as.data.table(event)
Then we'll prepare the data as follows:
key_cols = c("COMP", "DATE")
setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
setkeyv(dt1, key_cols)
The setcolorder
rearranges the columns of your data.tables by reference. setkeyv
sorts the data.table by the given columns in ascending order, and marks the key columns for dt1
.
The column reordering is essential here because, we don't set key on dt2
(because that'll sort dt2
which may be undesirable for you). And since no key is set of dt2
, data.table takes the first 'n' (=2 here) columns from dt2
to match with the key columns from dt1
.
Note: A join
x[i]
using data.table absolutely requires key ofx
to be set. Herex = dt1
. Setting key oni
is optional, depending on if you wish the order to be preserved or not.
Now, we perform two joins and get the corresponding matching indices:
idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
idx2 = dt1[dt2, which=TRUE, mult="last"]+1L
The first join gets for each match of dt2
in dt1
, the first matching position in dt1
. Similarly, the second join gets for each match of dt2
in dt1
, the last matching position in dt1
. We add -1 and +1 to get the previous and next indices respectively.
Take care of one special case:
idx1[idx1 == 0L] = NA
When the matching index is 1, subtracting it will result in 0. Because of R's behaviour on 0-index, we've to explicitly replace it with NA here.
Now, we can just subset those dates and add it to dt2
by reference as follows:
dt2[, `:=`(DATEm1 = dt1$DATE[idx1],
DATEp1 = dt1$DATE[idx2]
)]
# COMP DATE ARTICLE DATEm1 DATEp1
# 1: A 02.01.2000 blabla11 01.01.2000 03.01.2000
# 2: A 03.01.2000 blabla12 02.01.2000 06.01.2000
# 3: A 06.01.2000 blabla13 03.01.2000 07.01.2000
# 4: A 09.01.2000 blabla14 07.01.2000 10.01.2000
# 5: B 06.01.2000 blabla21 04.01.2000 07.01.2000
# 6: B 07.01.2000 blabla22 06.01.2000 09.01.2000
# 7: B 09.01.2000 blabla23 07.01.2000 10.01.2000