I am working with a large dataset in R, consisting of approximately 19 million rows, and over 81 columns, I need some guidance on efficiently processing it.
My dataset tracks reoccuring records by name (actually another ID but for simplicity let's say name here), alongside their value, start, and end dates, with some records having been cancelled after a few years. Here is a simplified example of my data structure:
name | value | cancelled | start | end |
---|---|---|---|---|
ABC | 77 | 2010 | 2011 | |
ABC | 66 | 2010 | 2011 | |
ABC | 55 | 2010 | 2011 | |
ABC | 44 | 2011 | 2012 | |
ABC | 33 | 2012 | 2011 | 2012 |
ABC | 22 | 2011 | 2012 | |
ABC | 11 | 2012 | 2013 | |
ABC | 44 | 2012 | 2013 | |
BAA | 33 | 2009 | 2012 | |
BAA | 22 | 2009 | 2010 | |
BAA | 45 | 2009 | 2010 | |
BAA | 23 | 2011 | 2009 | 2011 |
BAA | 54 | 2010 | 2011 | |
BAA | 15 | 2012 | 2013 | |
BAA | 42 | 2010 | 2011 |
My goal is to assign a unique ID to each sequence of related record, where a sequence is defined by consecutive record sharing the same name. My basic understanding is, if row has no cancellation date, then match row 1 of name ABC, to the next occuring row 2 of ABC, where the end date of a row 1 matches the start date of row 2. If a record is cancelled, it should not be linked to subsequent occurrences of the same name. For instance, the desired outcome would look like this (new unique id created for association, that allows me to first sort by name, then by unique id, then by start date):
name | value | cancelled | start | end | new_unique_id |
---|---|---|---|---|---|
ABC | 77 | 2010 | 2011 | 1 | |
ABC | 44 | 2011 | 2012 | 1 | |
ABC | 11 | 2012 | 2013 | 1 | |
ABC | 66 | 2010 | 2011 | 2 | |
ABC | 33 | 2012 | 2011 | 2012 | 2 |
ABC | 55 | 2010 | 2011 | 3 | |
ABC | 22 | 2011 | 2012 | 3 | |
ABC | 44 | 2012 | 2013 | 3 | |
BAA | 33 | 2009 | 2012 | 4 | |
BAA | 15 | 2012 | 2013 | 4 | |
BAA | 22 | 2009 | 2010 | 5 | |
BAA | 54 | 2010 | 2011 | 5 | |
BAA | 45 | 2009 | 2010 | 6 | |
BAA | 42 | 2010 | 2011 | 6 | |
BAA | 23 | 2011 | 2009 | 2011 | 7 |
Given the scale of my dataset (3GB FST & 17 Mio & 81 columns), using a traditional loop in R is proving to be inefficient. I am looking for a way to leverage vectorized operations or dplyr (or any other you can think of) functions to accomplish this task more efficiently.
Any insights or suggestions on how to approach this problem would be greatly appreciated!
edit: adding the r tables for table 1 (before):
table_before <- data.frame(
name = c("ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA"),
value = c(77, 66, 55, 44, 33, 22, 11, 44, 33, 22, 45, 23, 54, 15, 42),
cancelled = c(NA, NA, NA, NA, 2012, NA, NA, NA, NA, NA, NA, 2011, NA, NA, NA),
start = c(2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2009, 2009, 2009, 2009, 2010, 2012, 2010),
end = c(2011, 2011, 2011, 2012, 2012, 2012, 2013, 2013, 2012, 2010, 2010, 2011, 2011, 2013, 2011)
)
# The following is a desired state that I wish to achieve, as you can see the rows are sorted first by "Name", then by a unique id that should be generated and groups the rows together (associates them)
table_after <- data.frame(
name = c("ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA"),
value = c(77, 44, 11, 66, 33, 55, 22, 44, 33, 15, 22, 54, 45, 42, 23),
cancelled = c(NA, NA, NA, NA, 2012, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2011),
start = c(2010, 2011, 2012, 2010, 2011, 2010, 2011, 2012, 2009, 2012, 2009, 2010, 2009, 2010, 2009),
end = c(2011, 2012, 2013, 2011, 2012, 2011, 2012, 2013, 2012, 2013, 2010, 2011, 2010, 2011, 2011),
new_unique_id = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7)
)
edit: I have simplified the dataframe by first sorting by name, then by year and removed shortened by two rows which may made it unnecessarily complicated.
edit: here is a depiction of the first-come-first-serve logic. Simply put: For each row, check if a unique ID exists and if not, assign new id and find subsequent matches based on first find.
Not certain this will scale awesomely, but we can use this function to align/arrange rows.
fun <- function(start, end, cnx) {
group <- replace(rep(NA, length(start)), start == min(start),
seq_len(sum(start == min(start))))
start0 <- start
for (rn in seq_along(start0)) {
ind <- which(start0 %in% end[rn] & !cnx[rn])
if (length(ind)) {
group[ind[1]] <- group[rn]
start0[ind[1]] <- NA
}
}
group
}
One caveat is that since this finds the first (of possibly many) match, it is not exactly the groupings that you have in your data.
Using dplyr_1.1.0
; if you have an older version, replace all .by=c(..)
with a corresponding before group_by(..)
before the respective dplyr-verb.
library(dplyr)
tmp <- table_before |>
mutate(.by = name, unique_id = fun(start, end, !is.na(cancelled))) |>
arrange(name, unique_id, start)
tmp |>
summarize(.by = name, prevmax = max(unique_id)) |>
mutate(prevmax = c(0, cumsum(prevmax)[-n()])) |>
right_join(tmp, by = "name") |>
mutate(unique_id = unique_id + prevmax) |>
select(-prevmax)
# name value cancelled start end unique_id
# 1 ABC 77 NA 2010 2011 1
# 2 ABC 44 NA 2011 2012 1
# 3 ABC 11 NA 2012 2013 1
# 4 ABC 16 NA 2013 2014 1
# 5 ABC 66 NA 2010 2011 2
# 6 ABC 33 2012 2011 2012 2
# 7 ABC 55 NA 2010 2011 3
# 8 ABC 22 NA 2011 2012 3
# 9 ABC 44 NA 2012 2013 3
# 10 ABC 10 NA 2013 2014 3
# 11 BAA 33 NA 2009 2012 4
# 12 BAA 15 NA 2012 2013 4
# 13 BAA 22 NA 2009 2010 5
# 14 BAA 54 NA 2010 2011 5
# 15 BAA 45 NA 2009 2010 6
# 16 BAA 42 NA 2010 2011 6
# 17 BAA 23 2011 2009 2011 7
library(data.table)
DT <- as.data.table(table_before)[, unique_id := fun(start, end, !is.na(cancelled)), by = "name"] |>
setorder(name, unique_id, start)
DT[, .(prevmax = max(unique_id)), by = "name"
][, prevmax := c(0, cumsum(prevmax)[-.N])
][DT, on = "name"
][, unique_id := unique_id + prevmax]
# name prevmax value cancelled start end unique_id
# <char> <num> <num> <num> <num> <num> <num>
# 1: ABC 0 77 NA 2010 2011 1
# 2: ABC 0 44 NA 2011 2012 1
# 3: ABC 0 11 NA 2012 2013 1
# 4: ABC 0 16 NA 2013 2014 1
# 5: ABC 0 66 NA 2010 2011 2
# 6: ABC 0 33 2012 2011 2012 2
# 7: ABC 0 55 NA 2010 2011 3
# 8: ABC 0 22 NA 2011 2012 3
# 9: ABC 0 44 NA 2012 2013 3
# 10: ABC 0 10 NA 2013 2014 3
# 11: BAA 3 33 NA 2009 2012 4
# 12: BAA 3 15 NA 2012 2013 4
# 13: BAA 3 22 NA 2009 2010 5
# 14: BAA 3 54 NA 2010 2011 5
# 15: BAA 3 45 NA 2009 2010 6
# 16: BAA 3 42 NA 2010 2011 6
# 17: BAA 3 23 2011 2009 2011 7