I have a large dataset of nearly 200,000 rows and 20 columns (mix of numeric and string data). Each row has a unique identifier. Less than 100 rows have duplicate identifiers. I am trying to determine 2 things:
I've seen several other SO posts, but they usually discuss duplicates by columns, not rows and/or compare across data frames instead of within them.
Here is a small data example.
Note there are org_id values (a,b,c), of which a & b are duplicates. All values across the columns are duplicate for org_id a, but not org_id b.
# load toy data
df <- data.frame(org_id=c("a","a","b","b","b","c"),
thing=c("1","1","1","1","2","1"),
name=c("really_long_A_name_here", "really_long_A_name_here", "really_long_B_name_here", "really_long_B2_name_here", "really_long_B_name_here", "really_long_C_name_here"),
start=c("2020-10-31", "2020-10-31", "2022-09-17", "2022-09-17", "2022-09-17", "2023-05-11") )
df
org_id | score | name | start |
---|---|---|---|
a | 1 | really_long_A_name_here | 2020-10-31 |
a | 1 | really_long_A_name_here | 2020-10-31 |
b | 1 | really_long_B_name_here | 2022-09-17 |
b | 1 | really_long_B2_name_here | 2022-09-17 |
b | 2 | really_long_B_name_here | 2022-09-17 |
c | 1 | really_long_C_name_here | 2023-05-11 |
Here is an example of what I need: First, another data frame to tell me which org_id have duplicate row values, such as:
exact_dup | dup_orgs |
---|---|
TRUE | a |
FALSE | b |
So far, I combined row data into a long string for comparison. The code below works but seems clunky. Any suggestions on how to improve this?
# create a long string for each row
df$x <- apply(df, 1, paste0, collapse="|")
# placeholder dataframe to identify which uplicate rows in the data are exact duplicates across the entire row
review_dups <- data.frame(exact_dup = NA, dup_orgs = df |> filter(duplicated(org_id)==TRUE) |> distinct(org_id)|> pull(org_id) )
# loop to find differences
for(i in 1:nrow(review_dups)){
n <- df |> filter(org_id == review_dups$dup_orgs[i]) |> select(x) |> count(x) |> pull(n) |> max()
dup_rows <- df |> filter(org_id == review_dups$dup_orgs[i]) |> nrow()
review_dups[i,1] <- n==dup_rows
rm(n, dup_rows)
}
rm(i)
# view results
review_dups
Second, I need a way to report those data columns that don't match the rest of the rows for rows with duplicate org_id.
So, the output should tell me the org_id b score column and name column differ.
Preferably, the report values could appear as a third results column on the duplicate check data frame example above, but I'm open to different reporting options.
I don't yet have a code solution for this part.
Thanks!
I'm inferring that you're only concerned with comparing the 2nd and subsequent rows with the 1st row, not a complete pairwise set of differences.
This adds a column to the original frame that gives a comma-separated list of column names.
df$dupe_differences <- unlist(by(df, df$org_id, function(dat) {
if (nrow(dat) == 1) return(NA)
c("", sapply(2:nrow(dat), function(i) {
same <- mapply(Negate(`%in%`), dat[1,], dat[i,])
paste(names(same[same]), collapse = ",")
}))
}))
df
# org_id thing name start dupe_differences
# 1 a 1 really_long_A_name_here 2020-10-31
# 2 a 1 really_long_A_name_here 2020-10-31
# 3 b 1 really_long_B_name_here 2022-09-17
# 4 b 1 really_long_B2_name_here 2022-09-17 name
# 5 b 2 really_long_B_name_here 2022-09-17 thing
# 6 c 1 really_long_C_name_here 2023-05-11 <NA>
The distinction is unambiguous:
NA
means there are no duplicates, that row is unique""
(empty string) means that its contents are identical with the first row of that dupe-setFrom here, you can easily filter out the specific rows you want using is.na(.)
(for no-dupes), !is.na(.) & !nzchar(.)
for rows identical to the first row (including the first row), and !is.na(.) & nzchar(.)
for rows that are duplicates with differences.