Search code examples
rfuzzy-comparisonstringdistrecord-linkage

Jaro-Winkler's difference between packages


I am using fuzzy matching to clean up medication data input by users, and I am using Jaro-Winkler's distance. I was testing which package with Jaro-Winkler's distance was faster when I noticed the default settings do not give identical values. Can anyone help me understand where the difference comes from? Example:

library(RecordLinkage)
library(stringdist)

jarowinkler("advil", c("advi", "advill", "advil", "dvil", "sdvil"))
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
1- stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), method = "jw")
# [1] 0.9333333 0.9444444 1.0000000 0.9333333 0.8666667

I am assuming it has to do with the weights, and I know I am using the defaults on both. However, if someone with more experience could shed light on what's going on, I would really appreciate it. Thanks!

Documentation:

https://cran.r-project.org/web/packages/stringdist/stringdist.pdf https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf


Solution

  • Tucked away in the documentation for stringdist is the following:

    The Jaro-Winkler distance (method=jw, 0<p<=0.25) adds a correction term to the Jaro-distance. It is defined as d − l · p · d, where d is the Jaro-distance. Here, l is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor p is a penalty factor, which in the work of Winkler is often chosen 0.1.

    However, in stringdist::stringdist, p = 0 by default. Hence:

    1 - stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), 
                   method = "jw", p = .1)
    # [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
    

    In fact that value is hard-coded in the source of RecordLinkage::jarowinkler.