Search code examples
rstring-comparison

How to compare two data frames line by line?


About a month ago, I posted the original question where I was needing to compare two data frames line by line and label the lines from df2 (the second file) that do not match df1 (the first file). The solution was to use anti join. This worked great until I added an additional column with text strings. I need to also include that column in the comparison and detect which records of the text string do not match.

Attached is the example data frames. I need to compare df2 against df1 and show which lines in df2 to not match df1. I am able to use the anti join in R to show which lines do not match, but when I have text strings in the lines it does not work.

df1

Product basecode    A   B   C   D   E   F
Tractor A810     382    512 363 553 530 A dog ran fast
Tractor A773     222    155 650 278 215 A dog ran fast
Tractor A203     382    512 363 553 530 A dog ran fast
Tractor A329     332    459 251 341 475 A dog ran fast
Combine B244     244    714 467 122 340 A dog ran fast
Combine B302     257    758 230 704 715 A dog ran fast
Combine B681     670    626 572 795 323 A dog ran fast
Combine B514     768    510 546 542 582 A dog ran fast
Sprayer C850     553    624 557 660 337 A dog ran fast
Sprayer C202     561    733 443 107 526 A dog ran fast
Sprayer C619     256    226 257 770 633 A dog ran fast
Sprayer C292     256    226 257 770 633 A dog ran fast
SPFH    D126     323    597 647 159 317 A dog ran fast
SPFH    D307     711    535 323 793 769 A dog ran fast
SPFH    D355     155    744 772 689 509 A dog ran fast
SPFH    D893     155    744 772 689 509 A dog ran fast

df2

Product basecode    A   B   C   D   E   F
Tractor A810     382    512 363 553 530 A dog ran fast
Tractor A773     222    155 650 278 215 A dog ran fast
Tractor A203     382    512 363 553 530 A dog ran fast
Tractor A329     332    459 251 341 475 A dog ran fast
Combine B 244    244    714 467 122 340 A dog ran fast
Combine B302     257    758 230 704 715 A dog ran fast
Combine B681     670    626 572 795 323 A dog ran fast
Combine B514     768    510 546 542 582 A dog ran fast
Sprayer C850     553    624 557 660 337 A dog ran fast
Sprayer C202     561    733 443 107 526 A dog ran fast
Sprayer C619     256    226 257 770 633 A dog ran fast
Sprayer C292     1  1   1   1   1   A dog ran fast
SPFH    D126     323    597 647 159 317 A dog ran fast
SPFH    D307     711    535 323 793 769 A dog ran fast
SPFH    D355     155    744 772 689 509 A dog ran fast
SPFH    D893     1  1   1   1   1   A dog ran fast
Tractor A810     491    765 457 249 641 A dog ran fast
Tractor A773     222    155 650 278 215 A dog ran fast
Tractor A203     382    512 363 553 530 A dog ran fast
Tractor A329     332    459 251 341 475 A dog ran fast
Combine B 244    244    714 467 122 340 A dog ran fast
Combine B302     257    758 230 704 715 A cat ran slow
Combine B681     670    626 572 795 323 cat
Combine B514     768    510 546 542 582 A dog ran fast
Sprayer C850     553    624 557 660 337 A dog ran fast
Sprayer C202     561    733 443 107 526 A dog ran fast
Sprayer C619     256    226 257 770 633 A dog ran fast

Code

# add id to identify which rows are not matching
df2 <- df2 %>% mutate(id = basecode)

df_unmatch <- anti_join(df2, df1)

# list of non-match are the ids of df_unmatch
df_unmatch$id

Data

#structure(list(Product = c("Tractor", "Tractor", "Tractor", "Tractor", 
"Combine", "Combine", "Combine", "Combine", "Sprayer", "Sprayer", 
"Sprayer", "Sprayer", "SPFH", "SPFH", "SPFH", "SPFH", "Tractor", 
"Tractor", "Tractor", "Tractor", "Combine", "Combine", "Combine", 
"Combine", "Sprayer", "Sprayer", "Sprayer"), basecode = c("A810", 
"A773", "A203", "A329", "B 244", "B302", "B681", "B514", "C850", 
"C202", "C619", "C292", "D126", "D307", "D355", "D893", "A810", 
"A773", "A203", "A329", "B 244", "B302", "B681", "B514", "C850", 
"C202", "C619"), A = c(382, 222, 382, 332, 244, 257, 670, 768, 
553, 561, 256, 1, 323, 711, 155, 1, 491, 222, 382, 332, 244, 
257, 670, 768, 553, 561, 256), B = c(512, 155, 512, 459, 714, 
758, 626, 510, 624, 733, 226, 1, 597, 535, 744, 1, 765, 155, 
512, 459, 714, 758, 626, 510, 624, 733, 226), C = c(363, 650, 
363, 251, 467, 230, 572, 546, 557, 443, 257, 1, 647, 323, 772, 
1, 457, 650, 363, 251, 467, 230, 572, 546, 557, 443, 257), D = c(553, 
278, 553, 341, 122, 704, 795, 542, 660, 107, 770, 1, 159, 793, 
689, 1, 249, 278, 553, 341, 122, 704, 795, 542, 660, 107, 770
), E = c(530, 215, 530, 475, 340, 715, 323, 582, 337, 526, 633, 
1, 317, 769, 509, 1, 641, 215, 530, 475, 340, 715, 323, 582, 
337, 526, 633), F = c("A dog ran fast", "A dog ran fast", "A dog ran fast", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "cat", 
"A dog ran fast", "A dog ran fast", "A dog ran fast", "A dog ran fast"
)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"
))

Solution

  • It does work, unless you have some particular expectation (see comment by Limey). The two files you provided are actually identical (see comment by MonJeanJean), so let's start by creating mismatching lines:

    df1$F <- "A dog ran faster" ## df2 has "cat" somewhere
    df2$A[16] <- 155
    anti_join(df2, df1)
    
    # A tibble: 2 x 8
      Product basecode     A     B     C     D     E F             
      <chr>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <chr>         
    1 SPFH    D893         1     1     1     1     1 A dog ran fast
    2 Combine B681       670   626   572   795   323 cat           
    

    What result did you expect?