Search code examples
rstringalgorithmsimilaritystringdist

Best similarity distance metric for two strings


I have a bunch of company names to match, for example, I want to match this string: A&A PRECISION

with A&A PRECISION ENGINEERING

However, almost every similarity measure I use: like Hamming distance, Levenshtein distance, Restricted Damerau-Levenshtein distance, Full Damerau-Levenshtein distance, Longest Common Substring distance, Q-gram distance, cosine distance, Jaccard distance Jaro, and Jaro-Winkler distance

matches: B&B PRECISION instead.

Any idea which metric would give more emphasis to the preciseness of the substrings and its sequence matched and care less about the length of the string? I think it is because of the length of the string that the metrics would always choose wrongly.


Solution

  • If you really want to "...give more emphasis to the preciseness of the substrings and its sequence...", then this function could work, as it tests wether a string is a substring of another one:

    library(data.table)
    
    x <- c("A&A PRECISION", "A&A PRECISION ENGINEERING", "B&B PRECISION")
    y <- x
    

    We want to expand the grid. For that I'd use the CJ function in data.table. Then, we will check each pair and see if x is a substring of y (this doesn't work the other way round):

    CJ(x, y)[, similarity := apply(.SD, 1, function(x) x[2] %like% x[1]), .SDcols = c("x", "y")][x != y, ]
                               x                         y similarity
    1:             A&A PRECISION A&A PRECISION ENGINEERING       TRUE
    2:             A&A PRECISION             B&B PRECISION      FALSE
    3: A&A PRECISION ENGINEERING             A&A PRECISION      FALSE
    4: A&A PRECISION ENGINEERING             B&B PRECISION      FALSE
    5:             B&B PRECISION             A&A PRECISION      FALSE
    6:             B&B PRECISION A&A PRECISION ENGINEERING      FALSE
    

    Please keep in mind that you'll need to make sure that the strings are as neat as possible for this to work, and even then it might fail.

    There are some things I'll check to clean your strings:

    • Remove multiple spaces,
    • Remove spaces at the beginning / end of string
    • Ensure the same encoding
    • Ensure the same CASE

    You can achieve that with the stringi package.