Search code examples
rduplicatesdata-cleaning

R dedupe records that are not exactly duplicates


I have a list of record that I need to dedup, these look like a combination of the same set of, but using the regular functions to deduplicate records does not work because the two columns are not duplicates. Below is a reproducible example.

df <- data.frame( A  =  c("2","2","2","43","43","43","331","391","481","490","501","501","501","502","502","502"),

          B =  c("43","501","502","2","501","502","491","496","490","481","2","43","502","2","43","501"))

Below is the desired output that I'm looking for.

df_Final <- data.frame( A  =  c("2","2","2","331","391","481"),

          B =  c("43","501","502","491","496","490"))

Solution

  • I guess the idea is that you want to find when the elements in column A first appear in column B

    idx = match(df$A, df$B)
    

    and keep the row if the element in A isn't in B (is.na(idx)) or the element in A occurs before it's first occurrence in B (seq_along(idx) < idx)

    df[is.na(idx) | seq_along(idx) < idx,]
    

    Maybe a more-or-less literal tidyverse approach to this would be to create and then drop a temporary column

    library(tidyverse)
    df %>% mutate(idx = match(A, B)) %>%
        filter(is.na(idx) | seq_along(idx) < idx) %>%
        select(-idx)