Search code examples
rdplyrdata.table

Efficiently Assigning a common ID to Related Rows in Large DataFrame in R


I am working with a large dataset in R, consisting of approximately 19 million rows, and over 81 columns, I need some guidance on efficiently processing it.

My dataset tracks reoccuring records by name (actually another ID but for simplicity let's say name here), alongside their value, start, and end dates, with some records having been cancelled after a few years. Here is a simplified example of my data structure:

name value cancelled start end
ABC 77 2010 2011
ABC 66 2010 2011
ABC 55 2010 2011
ABC 44 2011 2012
ABC 33 2012 2011 2012
ABC 22 2011 2012
ABC 11 2012 2013
ABC 44 2012 2013
BAA 33 2009 2012
BAA 22 2009 2010
BAA 45 2009 2010
BAA 23 2011 2009 2011
BAA 54 2010 2011
BAA 15 2012 2013
BAA 42 2010 2011

My goal is to assign a unique ID to each sequence of related record, where a sequence is defined by consecutive record sharing the same name. My basic understanding is, if row has no cancellation date, then match row 1 of name ABC, to the next occuring row 2 of ABC, where the end date of a row 1 matches the start date of row 2. If a record is cancelled, it should not be linked to subsequent occurrences of the same name. For instance, the desired outcome would look like this (new unique id created for association, that allows me to first sort by name, then by unique id, then by start date):

name value cancelled start end new_unique_id
ABC 77 2010 2011 1
ABC 44 2011 2012 1
ABC 11 2012 2013 1
ABC 66 2010 2011 2
ABC 33 2012 2011 2012 2
ABC 55 2010 2011 3
ABC 22 2011 2012 3
ABC 44 2012 2013 3
BAA 33 2009 2012 4
BAA 15 2012 2013 4
BAA 22 2009 2010 5
BAA 54 2010 2011 5
BAA 45 2009 2010 6
BAA 42 2010 2011 6
BAA 23 2011 2009 2011 7

Given the scale of my dataset (3GB FST & 17 Mio & 81 columns), using a traditional loop in R is proving to be inefficient. I am looking for a way to leverage vectorized operations or dplyr (or any other you can think of) functions to accomplish this task more efficiently.

Any insights or suggestions on how to approach this problem would be greatly appreciated!

edit: adding the r tables for table 1 (before):

table_before <- data.frame(
  name = c("ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA"),
  value = c(77, 66, 55, 44, 33, 22, 11, 44, 33, 22, 45, 23, 54, 15, 42),
  cancelled = c(NA, NA, NA, NA, 2012, NA, NA, NA, NA, NA, NA, 2011, NA, NA, NA),
  start = c(2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2009, 2009, 2009, 2009, 2010, 2012, 2010),
  end = c(2011, 2011, 2011, 2012, 2012, 2012, 2013, 2013, 2012, 2010, 2010, 2011, 2011, 2013, 2011)
)

# The following is a desired state that I wish to achieve, as you can see the rows are sorted first by "Name", then by a unique id that should be generated and groups the rows together (associates them)

table_after <- data.frame(
  name = c("ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA", "BAA"),
  value = c(77, 44, 11, 66, 33, 55, 22, 44, 33, 15, 22, 54, 45, 42, 23),
  cancelled = c(NA, NA, NA, NA, 2012, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2011),
  start = c(2010, 2011, 2012, 2010, 2011, 2010, 2011, 2012, 2009, 2012, 2009, 2010, 2009, 2010, 2009),
  end = c(2011, 2012, 2013, 2011, 2012, 2011, 2012, 2013, 2012, 2013, 2010, 2011, 2010, 2011, 2011),
  new_unique_id = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7)
)

edit: I have simplified the dataframe by first sorting by name, then by year and removed shortened by two rows which may made it unnecessarily complicated.

edit: here is a depiction of the first-come-first-serve logic. Simply put: For each row, check if a unique ID exists and if not, assign new id and find subsequent matches based on first find. Image

edit: another depiction of the issue enter image description here


Solution

  • Not certain this will scale awesomely, but we can use this function to align/arrange rows.

    fun <- function(start, end, cnx) {
      group <- replace(rep(NA, length(start)), start == min(start),
                       seq_len(sum(start == min(start))))
      start0 <- start
      for (rn in seq_along(start0)) {
        ind <- which(start0 %in% end[rn] & !cnx[rn])
        if (length(ind)) {
          group[ind[1]] <- group[rn]
          start0[ind[1]] <- NA
        }
      }
      group
    }
    

    One caveat is that since this finds the first (of possibly many) match, it is not exactly the groupings that you have in your data.

    dplyr

    Using dplyr_1.1.0; if you have an older version, replace all .by=c(..) with a corresponding before group_by(..) before the respective dplyr-verb.

    library(dplyr)
    tmp <- table_before |>
      mutate(.by = name, unique_id = fun(start, end, !is.na(cancelled))) |>
      arrange(name, unique_id, start)
    tmp |>
      summarize(.by = name, prevmax = max(unique_id)) |>
      mutate(prevmax = c(0, cumsum(prevmax)[-n()])) |>
      right_join(tmp, by = "name") |>
      mutate(unique_id = unique_id + prevmax) |>
      select(-prevmax)
    #    name value cancelled start  end unique_id
    # 1   ABC    77        NA  2010 2011         1
    # 2   ABC    44        NA  2011 2012         1
    # 3   ABC    11        NA  2012 2013         1
    # 4   ABC    16        NA  2013 2014         1
    # 5   ABC    66        NA  2010 2011         2
    # 6   ABC    33      2012  2011 2012         2
    # 7   ABC    55        NA  2010 2011         3
    # 8   ABC    22        NA  2011 2012         3
    # 9   ABC    44        NA  2012 2013         3
    # 10  ABC    10        NA  2013 2014         3
    # 11  BAA    33        NA  2009 2012         4
    # 12  BAA    15        NA  2012 2013         4
    # 13  BAA    22        NA  2009 2010         5
    # 14  BAA    54        NA  2010 2011         5
    # 15  BAA    45        NA  2009 2010         6
    # 16  BAA    42        NA  2010 2011         6
    # 17  BAA    23      2011  2009 2011         7
    

    data.table

    library(data.table)
    DT <- as.data.table(table_before)[, unique_id := fun(start, end, !is.na(cancelled)), by = "name"] |>
      setorder(name, unique_id, start)
    DT[, .(prevmax = max(unique_id)), by = "name"
      ][, prevmax := c(0, cumsum(prevmax)[-.N])
      ][DT, on = "name"
      ][, unique_id := unique_id + prevmax]
    #       name prevmax value cancelled start   end unique_id
    #     <char>   <num> <num>     <num> <num> <num>     <num>
    #  1:    ABC       0    77        NA  2010  2011         1
    #  2:    ABC       0    44        NA  2011  2012         1
    #  3:    ABC       0    11        NA  2012  2013         1
    #  4:    ABC       0    16        NA  2013  2014         1
    #  5:    ABC       0    66        NA  2010  2011         2
    #  6:    ABC       0    33      2012  2011  2012         2
    #  7:    ABC       0    55        NA  2010  2011         3
    #  8:    ABC       0    22        NA  2011  2012         3
    #  9:    ABC       0    44        NA  2012  2013         3
    # 10:    ABC       0    10        NA  2013  2014         3
    # 11:    BAA       3    33        NA  2009  2012         4
    # 12:    BAA       3    15        NA  2012  2013         4
    # 13:    BAA       3    22        NA  2009  2010         5
    # 14:    BAA       3    54        NA  2010  2011         5
    # 15:    BAA       3    45        NA  2009  2010         6
    # 16:    BAA       3    42        NA  2010  2011         6
    # 17:    BAA       3    23      2011  2009  2011         7