I'm using compare.linkage method in the record linkage package in R to compare similarity of 2 set of strings. The default string comparing method is jarowinkler with the 3 default weightages set at 1/3, 1/3 and 1/3.
I want to overwrite the default weightages to say 4/9, 4/9 and 1/9. How do I do that? Thanks in advance.
The default script is:
rpairs <- compare.linkage(StringSet1, StringSet2, strcmp = TRUE, strcmpfun = jarowinkler)
You have to create your own comparison function, which compares two strings. In that function you can call jarowinkler
. The easiest way to do this is to create a closure:
jw <- function(W_1, W_2, W_3) {
function(str1, str2) {
jarowinkler(str1, str2, W_1, W_2, W_3)
}
}
This is a function to which to pass the weight parameters you want to use. This function returns a comparison function which you can use in your compare.linkage
call:
rpairs <- compare.linkage(StringSet1, StringSet2,
strcmp = TRUE, strcmpfun = jw(4/9, 4/9, 1/9))
The Jaro-Winkler algorithm counts the number of characters that match (withing a certain bandwidth) m
. For the two strings john
and johan
there are 4 characters that match (j
, o
, h
and n
). Taking only the selected characters:
john
jonh
It then counts the number of transpositions t
. In this case there is one transposition (the h
and n
are switched).
The Jaro similarity is given by:
1/3 * (w1 * m/l1 + w2 * m/l2 + w3 * (m-t)/m))
with l1
and l2
the lengths of the two strings. For weights all equal to 1/3 this results in a score between 0 and 1 (1=perfect match).
The Jaro-Winkler measure adds a 'bonus' for characters that match at the beginning of the string as there are usually less errors at the beginning (the measure is created for names). For more information see for example M.P.J van der Loo (2014), The stringdist Package for Approximate String Matching.