Efficiently Assigning a common ID to Related Rows in Large DataFrame in R

I am working with a large dataset in R, consisting of approximately 19 million rows, and over 81 columns, I need some guidance on efficiently processing it.

My dataset tracks reoccuring records by name (actually another ID but for simplicity let's say name here), alongside their value, start, and end dates, with some records having been cancelled after a few years. Here is a simplified example of my data structure:

name	value	cancelled	start	end
ABC	77		2010	2011
ABC	66		2010	2011
ABC	55		2010	2011
ABC	44		2011	2012
ABC	33	2012	2011	2012
ABC	22		2011	2012
ABC	11		2012	2013
ABC	44		2012	2013
BAA	33		2009	2012
BAA	22		2009	2010
BAA	45		2009	2010
BAA	23	2011	2009	2011
BAA	54		2010	2011
BAA	15		2012	2013
BAA	42		2010	2011

My goal is to assign a unique ID to each sequence of related record, where a sequence is defined by consecutive record sharing the same name. My basic understanding is, if row has no cancellation date, then match row 1 of name ABC, to the next occuring row 2 of ABC, where the end date of a row 1 matches the start date of row 2. If a record is cancelled, it should not be linked to subsequent occurrences of the same name. For instance, the desired outcome would look like this (new unique id created for association, that allows me to first sort by name, then by unique id, then by start date):

name	value	cancelled	start	end	new_unique_id
ABC	77		2010	2011	1
ABC	44		2011	2012	1
ABC	11		2012	2013	1
ABC	66		2010	2011	2
ABC	33	2012	2011	2012	2
ABC	55		2010	2011	3
ABC	22		2011	2012	3
ABC	44		2012	2013	3
BAA	33		2009	2012	4
BAA	15		2012	2013	4
BAA	22		2009	2010	5
BAA	54		2010	2011	5
BAA	45		2009	2010	6
BAA	42		2010	2011	6
BAA	23	2011	2009	2011	7

Given the scale of my dataset (3GB FST & 17 Mio & 81 columns), using a traditional loop in R is proving to be inefficient. I am looking for a way to leverage vectorized operations or dplyr (or any other you can think of) functions to accomplish this task more efficiently.

Any insights or suggestions on how to approach this problem would be greatly appreciated!

edit: adding the r tables for table 1 (before):

table_before <- data.frame(
  name = c("ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA"),
  value = c(77, 66, 55, 44, 33, 22, 11, 44, 33, 22, 45, 23, 54, 15, 42),
  cancelled = c(NA, NA, NA, NA, 2012, NA, NA, NA, NA, NA, NA, 2011, NA, NA, NA),
  start = c(2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2009, 2009, 2009, 2009, 2010, 2012, 2010),
  end = c(2011, 2011, 2011, 2012, 2012, 2012, 2013, 2013, 2012, 2010, 2010, 2011, 2011, 2013, 2011)
)

# The following is a desired state that I wish to achieve, as you can see the rows are sorted first by "Name", then by a unique id that should be generated and groups the rows together (associates them)

table_after <- data.frame(
  name = c("ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA"),
  value = c(77, 44, 11, 66, 33, 55, 22, 44, 33, 15, 22, 54, 45, 42, 23),
  cancelled = c(NA, NA, NA, NA, 2012, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2011),
  start = c(2010, 2011, 2012, 2010, 2011, 2010, 2011, 2012, 2009, 2012, 2009, 2010, 2009, 2010, 2009),
  end = c(2011, 2012, 2013, 2011, 2012, 2011, 2012, 2013, 2012, 2013, 2010, 2011, 2010, 2011, 2011),
  new_unique_id = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7)
)

edit: I have simplified the dataframe by first sorting by name, then by year and removed shortened by two rows which may made it unnecessarily complicated.

edit: here is a depiction of the first-come-first-serve logic. Simply put: For each row, check if a unique ID exists and if not, assign new id and find subsequent matches based on first find.

edit: another depiction of the issue

Solution

Not certain this will scale awesomely, but we can use this function to align/arrange rows.

fun <- function(start, end, cnx) {
  group <- replace(rep(NA, length(start)), start == min(start),
                   seq_len(sum(start == min(start))))
  start0 <- start
  for (rn in seq_along(start0)) {
    ind <- which(start0 %in% end[rn] & !cnx[rn])
    if (length(ind)) {
      group[ind[1]] <- group[rn]
      start0[ind[1]] <- NA
    }
  }
  group
}

One caveat is that since this finds the first (of possibly many) match, it is not exactly the groupings that you have in your data.

dplyr

Using dplyr_1.1.0; if you have an older version, replace all .by=c(..) with a corresponding before group_by(..) before the respective dplyr-verb.

library(dplyr)
tmp <- table_before |>
  mutate(.by = name, unique_id = fun(start, end, !is.na(cancelled))) |>
  arrange(name, unique_id, start)
tmp |>
  summarize(.by = name, prevmax = max(unique_id)) |>
  mutate(prevmax = c(0, cumsum(prevmax)[-n()])) |>
  right_join(tmp, by = "name") |>
  mutate(unique_id = unique_id + prevmax) |>
  select(-prevmax)
#    name value cancelled start  end unique_id
# 1   ABC    77        NA  2010 2011         1
# 2   ABC    44        NA  2011 2012         1
# 3   ABC    11        NA  2012 2013         1
# 4   ABC    16        NA  2013 2014         1
# 5   ABC    66        NA  2010 2011         2
# 6   ABC    33      2012  2011 2012         2
# 7   ABC    55        NA  2010 2011         3
# 8   ABC    22        NA  2011 2012         3
# 9   ABC    44        NA  2012 2013         3
# 10  ABC    10        NA  2013 2014         3
# 11  BAA    33        NA  2009 2012         4
# 12  BAA    15        NA  2012 2013         4
# 13  BAA    22        NA  2009 2010         5
# 14  BAA    54        NA  2010 2011         5
# 15  BAA    45        NA  2009 2010         6
# 16  BAA    42        NA  2010 2011         6
# 17  BAA    23      2011  2009 2011         7

data.table

library(data.table)
DT <- as.data.table(table_before)[, unique_id := fun(start, end, !is.na(cancelled)), by = "name"] |>
  setorder(name, unique_id, start)
DT[, .(prevmax = max(unique_id)), by = "name"
  ][, prevmax := c(0, cumsum(prevmax)[-.N])
  ][DT, on = "name"
  ][, unique_id := unique_id + prevmax]
#       name prevmax value cancelled start   end unique_id
#     <char>   <num> <num>     <num> <num> <num>     <num>
#  1:    ABC       0    77        NA  2010  2011         1
#  2:    ABC       0    44        NA  2011  2012         1
#  3:    ABC       0    11        NA  2012  2013         1
#  4:    ABC       0    16        NA  2013  2014         1
#  5:    ABC       0    66        NA  2010  2011         2
#  6:    ABC       0    33      2012  2011  2012         2
#  7:    ABC       0    55        NA  2010  2011         3
#  8:    ABC       0    22        NA  2011  2012         3
#  9:    ABC       0    44        NA  2012  2013         3
# 10:    ABC       0    10        NA  2013  2014         3
# 11:    BAA       3    33        NA  2009  2012         4
# 12:    BAA       3    15        NA  2012  2013         4
# 13:    BAA       3    22        NA  2009  2010         5
# 14:    BAA       3    54        NA  2010  2011         5
# 15:    BAA       3    45        NA  2009  2010         6
# 16:    BAA       3    42        NA  2010  2011         6
# 17:    BAA       3    23      2011  2009  2011         7