Search code examples
rdataframerowuniquedifference

Better way to find duplicate entire rows and flag minor differences within single R dataframe?


I have a large dataset of nearly 200,000 rows and 20 columns (mix of numeric and string data). Each row has a unique identifier. Less than 100 rows have duplicate identifiers. I am trying to determine 2 things:

  1. If each row with a duplicate identifier has the exact same duplicate values across all 20 columns or just some columns (min 1 column for the identifier).
  2. For each set of rows with duplicate identifiers that do not have the same values across all columns, identify which columns have different values.

I've seen several other SO posts, but they usually discuss duplicates by columns, not rows and/or compare across data frames instead of within them.

Here is a small data example.
Note there are org_id values (a,b,c), of which a & b are duplicates. All values across the columns are duplicate for org_id a, but not org_id b.

# load toy data
df <- data.frame(org_id=c("a","a","b","b","b","c"),
                 thing=c("1","1","1","1","2","1"),
                 name=c("really_long_A_name_here", "really_long_A_name_here", "really_long_B_name_here", "really_long_B2_name_here", "really_long_B_name_here", "really_long_C_name_here"),
                 start=c("2020-10-31", "2020-10-31", "2022-09-17", "2022-09-17", "2022-09-17", "2023-05-11") )
df
org_id score name start
a 1 really_long_A_name_here 2020-10-31
a 1 really_long_A_name_here 2020-10-31
b 1 really_long_B_name_here 2022-09-17
b 1 really_long_B2_name_here 2022-09-17
b 2 really_long_B_name_here 2022-09-17
c 1 really_long_C_name_here 2023-05-11

Here is an example of what I need: First, another data frame to tell me which org_id have duplicate row values, such as:

exact_dup dup_orgs
TRUE a
FALSE b

So far, I combined row data into a long string for comparison. The code below works but seems clunky. Any suggestions on how to improve this?

# create a long string for each row
df$x <- apply(df, 1, paste0, collapse="|")
# placeholder dataframe to identify which uplicate rows in the data are exact duplicates across the entire row
review_dups <- data.frame(exact_dup = NA, dup_orgs = df |> filter(duplicated(org_id)==TRUE) |> distinct(org_id)|> pull(org_id) )
# loop to find differences
for(i in 1:nrow(review_dups)){
  n <- df |> filter(org_id == review_dups$dup_orgs[i]) |> select(x) |> count(x) |> pull(n) |> max()
  dup_rows <- df |> filter(org_id == review_dups$dup_orgs[i]) |> nrow()
  review_dups[i,1] <- n==dup_rows
  rm(n, dup_rows)
}
rm(i)
# view results
review_dups

Second, I need a way to report those data columns that don't match the rest of the rows for rows with duplicate org_id.
So, the output should tell me the org_id b score column and name column differ.
Preferably, the report values could appear as a third results column on the duplicate check data frame example above, but I'm open to different reporting options.
I don't yet have a code solution for this part.

Thanks!


Solution

  • I'm inferring that you're only concerned with comparing the 2nd and subsequent rows with the 1st row, not a complete pairwise set of differences.

    This adds a column to the original frame that gives a comma-separated list of column names.

    df$dupe_differences <- unlist(by(df, df$org_id, function(dat) {
      if (nrow(dat) == 1) return(NA)
      c("", sapply(2:nrow(dat), function(i) {
        same <- mapply(Negate(`%in%`), dat[1,], dat[i,])
        paste(names(same[same]), collapse = ",")
      }))
    }))
    
    df
    #   org_id thing                     name      start dupe_differences
    # 1      a     1  really_long_A_name_here 2020-10-31                 
    # 2      a     1  really_long_A_name_here 2020-10-31                 
    # 3      b     1  really_long_B_name_here 2022-09-17                 
    # 4      b     1 really_long_B2_name_here 2022-09-17             name
    # 5      b     2  really_long_B_name_here 2022-09-17            thing
    # 6      c     1  really_long_C_name_here 2023-05-11             <NA>
    

    The distinction is unambiguous:

    • NA means there are no duplicates, that row is unique
    • "" (empty string) means that its contents are identical with the first row of that dupe-set
    • anything else lists the column names (comma-separated) that are different from the first row of that dupe-set

    From here, you can easily filter out the specific rows you want using is.na(.) (for no-dupes), !is.na(.) & !nzchar(.) for rows identical to the first row (including the first row), and !is.na(.) & nzchar(.) for rows that are duplicates with differences.