I am trying to find overlapping periods for a data frame with two identifiers, which differs from the other questions which have been asked on this topic where overlapping observations had to be counted for only one identifier.
In my data frame, every transaction has a unique ID and every firm which is transacted in has a unique firm identifier. I am trying to find overlapping periods per firm identifier, counted per transaction ID.
I used the code from @Waldi to create this reproducible example, found in this topic (I only added a firm identifier or "FirmID" to the sample data frame):
Code to create a similar dataset:
library (data.table)
size = 1e5
df <- data.frame(
ID = sample(1:round(size / 5, 0)),
FirmID = sample(1:20000),
period = sample(c(5,10,30,45), size, replace = TRUE),
start = sample(seq(
as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
), size, replace = TRUE)
) %>% mutate(end = start + period)
The code to find overlapping periods:
dt <- data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]
setkey(dt,FirmID,start,end)
foverlaps(dt,dt,by.x=c("FirmID","start","end"),by.y=c("FirmID","start","end"))[
,.(noverlap=.N),by=.(FirmID,row)][
,.(overlap = max(noverlap>1)),by=FirmID][
,.(n=.N),by=.(overlap)][
,pct:=n/sum(n)][]
The only problem with this code, is that it displays the number of overlapping firms, instead of the number of overlapping transactions within a firm identifier.
overlap n pct
1: 0 5333 0.26665
2: 1 14667 0.73335
How can this code be altered for overlapping transaction IDs within a firm identifier? One change I made myself did not yield satisfactory results:
setkey(dt,FirmID,start,end)
foverlaps(dt,dt,by.x=c("FirmID","start","end"),by.y=c("FirmID","start","end"))[
,.(noverlap=.N),by=.(ID,row)][
,.(overlap = max(noverlap>1)),by=ID][
,.(n=.N),by=.(overlap)][
,pct:=n/sum(n)][]
overlap n pct
1: 0 5333 0.26665
2: 1 14667 0.73335
I had a hard time wrapping my head around the solution for the other scenario, so I'm not sure how to adjust that version, but I believe you can achieve the result you want in this case with an inner non-equi join:
dt <- data.table(df, key = c("FirmID", "ID", "start", "end"))
dt[, firm_total := .N, by = "FirmID"][
dt, .(FirmID, firm_total), on = .(FirmID, ID < ID, start <= end, end >= start), nomatch = NULL, mult = "all"][
, .(n = .N, pct = .N / (firm_total[1L] * (firm_total[1L] - 1L) / 2L)), by = "FirmID"]
We avoid redundant rows in the join with ID < ID
(and note that here that should be interpreted as left-hand side column's ID < right-hand side column's ID).
Redundancy can happen because,
if ID x
overlaps with ID y
,
y
overlaps with x
.
If you think of all pairs of IDs and put them in a matrix,
the maximum number of overlaps would be the number of elements in the lower triangular,
which can be calculated with n * (n - 1) / 2
,
that's why we initially add firm_total
.
I didn't do extensive testing but this version may be better for this scenario.
The documentation of foverlaps
states that it's mainly targeted at joins where one table is much smaller than the other one,
and explains why it may be an expensive operation.
You're doing a self-join,
so both tables are the same size.
And a table.express
version of the solution because why not:
library(table.express)
dt %>%
group_by(FirmID) %>%
mutate(firm_total = .N) %>%
inner_join(dt, FirmID, ID < ID, start <= end, end >= start, .expr = TRUE) %>%
select(FirmID, firm_total) %>%
group_by(FirmID) %>%
summarize(n = .N, pct = .N / (firm_total[1L] * (firm_total[1L] - 1L) / 2))
EDIT: and if you want your overlap
column you could compute that with size of lower triangular - n
.