About a month ago, I posted the original question where I was needing to compare two data frames line by line and label the lines from df2 (the second file) that do not match df1 (the first file). The solution was to use anti join. This worked great until I added an additional column with text strings. I need to also include that column in the comparison and detect which records of the text string do not match.
Attached is the example data frames. I need to compare df2 against df1 and show which lines in df2 to not match df1. I am able to use the anti join in R to show which lines do not match, but when I have text strings in the lines it does not work.
df1
Product basecode A B C D E F
Tractor A810 382 512 363 553 530 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A dog ran fast
Combine B681 670 626 572 795 323 A dog ran fast
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
Sprayer C292 256 226 257 770 633 A dog ran fast
SPFH D126 323 597 647 159 317 A dog ran fast
SPFH D307 711 535 323 793 769 A dog ran fast
SPFH D355 155 744 772 689 509 A dog ran fast
SPFH D893 155 744 772 689 509 A dog ran fast
df2
Product basecode A B C D E F
Tractor A810 382 512 363 553 530 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B 244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A dog ran fast
Combine B681 670 626 572 795 323 A dog ran fast
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
Sprayer C292 1 1 1 1 1 A dog ran fast
SPFH D126 323 597 647 159 317 A dog ran fast
SPFH D307 711 535 323 793 769 A dog ran fast
SPFH D355 155 744 772 689 509 A dog ran fast
SPFH D893 1 1 1 1 1 A dog ran fast
Tractor A810 491 765 457 249 641 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B 244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A cat ran slow
Combine B681 670 626 572 795 323 cat
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
Code
# add id to identify which rows are not matching
df2 <- df2 %>% mutate(id = basecode)
df_unmatch <- anti_join(df2, df1)
# list of non-match are the ids of df_unmatch
df_unmatch$id
Data
#structure(list(Product = c("Tractor", "Tractor", "Tractor", "Tractor",
"Combine", "Combine", "Combine", "Combine", "Sprayer", "Sprayer",
"Sprayer", "Sprayer", "SPFH", "SPFH", "SPFH", "SPFH", "Tractor",
"Tractor", "Tractor", "Tractor", "Combine", "Combine", "Combine",
"Combine", "Sprayer", "Sprayer", "Sprayer"), basecode = c("A810",
"A773", "A203", "A329", "B 244", "B302", "B681", "B514", "C850",
"C202", "C619", "C292", "D126", "D307", "D355", "D893", "A810",
"A773", "A203", "A329", "B 244", "B302", "B681", "B514", "C850",
"C202", "C619"), A = c(382, 222, 382, 332, 244, 257, 670, 768,
553, 561, 256, 1, 323, 711, 155, 1, 491, 222, 382, 332, 244,
257, 670, 768, 553, 561, 256), B = c(512, 155, 512, 459, 714,
758, 626, 510, 624, 733, 226, 1, 597, 535, 744, 1, 765, 155,
512, 459, 714, 758, 626, 510, 624, 733, 226), C = c(363, 650,
363, 251, 467, 230, 572, 546, 557, 443, 257, 1, 647, 323, 772,
1, 457, 650, 363, 251, 467, 230, 572, 546, 557, 443, 257), D = c(553,
278, 553, 341, 122, 704, 795, 542, 660, 107, 770, 1, 159, 793,
689, 1, 249, 278, 553, 341, 122, 704, 795, 542, 660, 107, 770
), E = c(530, 215, 530, 475, 340, 715, 323, 582, 337, 526, 633,
1, 317, 769, 509, 1, 641, 215, 530, 475, 340, 715, 323, 582,
337, 526, 633), F = c("A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "cat",
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast"
)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"
))
It does work, unless you have some particular expectation (see comment by Limey). The two files you provided are actually identical (see comment by MonJeanJean), so let's start by creating mismatching lines:
df1$F <- "A dog ran faster" ## df2 has "cat" somewhere
df2$A[16] <- 155
anti_join(df2, df1)
# A tibble: 2 x 8
Product basecode A B C D E F
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 SPFH D893 1 1 1 1 1 A dog ran fast
2 Combine B681 670 626 572 795 323 cat
What result did you expect?