I am using fuzzy matching to clean up medication data input by users, and I am using Jaro-Winkler's distance. I was testing which package with Jaro-Winkler's distance was faster when I noticed the default settings do not give identical values. Can anyone help me understand where the difference comes from? Example:
library(RecordLinkage)
library(stringdist)
jarowinkler("advil", c("advi", "advill", "advil", "dvil", "sdvil"))
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
1- stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), method = "jw")
# [1] 0.9333333 0.9444444 1.0000000 0.9333333 0.8666667
I am assuming it has to do with the weights, and I know I am using the defaults on both. However, if someone with more experience could shed light on what's going on, I would really appreciate it. Thanks!
Documentation:
https://cran.r-project.org/web/packages/stringdist/stringdist.pdf https://cran.r-project.org/web/packages/RecordLinkage/RecordLinkage.pdf
Tucked away in the documentation for stringdist
is the following:
The Jaro-Winkler distance (
method=jw
,0<p<=0.25
) adds a correction term to the Jaro-distance. It is defined asd − l · p · d
, whered
is the Jaro-distance. Here,l
is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factorp
is a penalty factor, which in the work of Winkler is often chosen 0.1.
However, in stringdist::stringdist
, p = 0
by default. Hence:
1 - stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"),
method = "jw", p = .1)
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667
In fact that value is hard-coded in the source of RecordLinkage::jarowinkler
.