Search code examples
rdplyr

Finding the most recent update within a table


library(dplyr)


my_tibble <- tibble(
  parentId = character(),
  origId = character(),
  id = character(),
  value = double()
)

original_tibble <- my_tibble %>%
  add_row(parentId = "2024Z123", origId = NA, id = "2024Z123", value = 2) %>%
  add_row(parentId = "2024Z123", origId = "2024Z123", id = "2024A652", value = 6) %>%
  add_row(parentId = "2024A652", origId = "2024Z123", id = "2024X156", value = 5) %>%
  add_row(parentId = "2020G001", origId = NA, id = "2020G001", value = 1) %>%
  add_row(parentId = "2020G001", origId = "2020G001", id = "2024X156", value = 2) %>%
  add_row(parentId = "2021B456", origId = NA, id = "2021B456", value = 3)

final_tibble <- my_tibble %>%
  add_row(parentId = "2024A652", origId = "2024Z123", id = "2024X156", value = 5) %>%
  add_row(parentId = "2020G001", origId = "2020G001", id = "2024Z156", value = 2) %>%
  add_row(parentId = "2021B456", origId = NA, id = "2021B456", value = 3)

The original_tibble contains 3 chains. The way the chain evolves is the first entry has parentId = id, and origId = NA and a value. The id entries are characters and the values are ints. Then for each update to the value a new row with the parentId = most recent update row id, origId = the id of the first entry, and id = unique entry for this new row, and the new value entry. The chains also aren't necessarily ordered within the rows in the way I have built the table above, so can't rely on that.

The final_tibble is the result I'm looking for. Trying to do this in dplyr but any solution is fine. What I want is to keep the most recent row entry and eliminate the previous entries from the chain.


Solution

  • First completing a grouping variable grp by combining parentId and origId, then looking for an id that is unique, which is automatically the most recent one.

    library(dplyr)
    
    original_tibble %>%
      group_by(grp = if_else(is.na(origId), parentId, origId)) %>%
      filter(!id %in% parentId | n() == 1) %>%
      ungroup() %>%
      select(-grp)
    # A tibble: 3 × 4
      parentId origId   id       value
      <chr>    <chr>    <chr>    <dbl>
    1 2024A652 2024Z123 2024X156     5
    2 2020G001 2020G001 2024X156     2
    3 2021B456 NA       2021B456     3
    

    Note, in your case you can replace if_else(...) with coalesce(origId, parentId)