Search code examples
rrecordstrcmplinkage

Setting weightages for Jarowinkler in compare.linkage


I'm using compare.linkage method in the record linkage package in R to compare similarity of 2 set of strings. The default string comparing method is jarowinkler with the 3 default weightages set at 1/3, 1/3 and 1/3.

I want to overwrite the default weightages to say 4/9, 4/9 and 1/9. How do I do that? Thanks in advance.

The default script is:

rpairs <- compare.linkage(StringSet1, StringSet2, strcmp = TRUE, strcmpfun = jarowinkler)

Solution

  • You have to create your own comparison function, which compares two strings. In that function you can call jarowinkler. The easiest way to do this is to create a closure:

    jw <- function(W_1, W_2, W_3) {
      function(str1, str2) {
        jarowinkler(str1, str2, W_1, W_2, W_3)
      }
    }
    

    This is a function to which to pass the weight parameters you want to use. This function returns a comparison function which you can use in your compare.linkage call:

    rpairs <- compare.linkage(StringSet1, StringSet2,
      strcmp = TRUE, strcmpfun = jw(4/9, 4/9, 1/9))
    

    The Jaro-Winkler algorithm counts the number of characters that match (withing a certain bandwidth) m. For the two strings john and johan there are 4 characters that match (j, o, h and n). Taking only the selected characters:

    john
    jonh
    

    It then counts the number of transpositions t. In this case there is one transposition (the h and n are switched).

    The Jaro similarity is given by:

    1/3 * (w1 * m/l1 + w2 * m/l2 + w3 * (m-t)/m)) 
    

    with l1 and l2 the lengths of the two strings. For weights all equal to 1/3 this results in a score between 0 and 1 (1=perfect match).

    The Jaro-Winkler measure adds a 'bonus' for characters that match at the beginning of the string as there are usually less errors at the beginning (the measure is created for names). For more information see for example M.P.J van der Loo (2014), The stringdist Package for Approximate String Matching.