Search code examples
rfuzzy-search

Fuzzy merging in R


How to join two objects, if they are semantically different?

1.Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season
2.195/75 R16C lid CORDIANT Business CA

But this is the same product, because matches its article 195/75 R16С

and one example

1.185/75 R16C lid Forward Professional 156 ASHK tubeless
2.The tire `185/75 R16С` С-156

185/75 R16C

New question aboit this topic R:Error in compare.linkage : Data sets have different format


Solution

  • So here is a solution using the RecordLinkage package. I think this does what you want.

    Example data:

    library(tidyverse)
    library(RecordLinkage)
    
    df_01 <- tibble(
      product = c("Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season",
                  "Something else")
    )
    df_02 <- tibble(
      product = c("195/75 R16C lid CORDIANT Business CA", 
                  "Different Product")
    )
    

    The details of this next part are probably best left to the RecordLinkage documentation:

    rpairs_jar <- compare.linkage(df_01, df_02,
                                  strcmp = c("product"),
                                  strcmpfun = jarowinkler)
    
    rpairs_epiwt <- epiWeights(rpairs_jar)
    
    getPairs(rpairs_epiwt, max.weight = Inf, min.weight = -Inf)
    
       id                                                      product    Weight
    1   1 Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season          
    2   1                         195/75 R16C lid CORDIANT Business CA 0.6135377
    3                                                                           
    4   2                                               Something else          
    5   2                                            Different Product 0.4827264
    6                                                                           
    7   1 Tire 195 / 75R16C Cordiant Business CA 107 / 105R all-season          
    8   2                                            Different Product 0.4586156
    9                                                                           
    10  2                                               Something else          
    11  1                         195/75 R16C lid CORDIANT Business CA 0.4320106
    

    So, this results in a probability of two rows matching. As you can see, the rows you want to match return the highest weight.