I understand that the q-gram distance is the sum of absolute differences between q-gram vectors of both strings. But I see some weird behavior when one of the strings is shorter than the chosen q.
So for these two strings, while the qgrams
function is correct:
> qgrams("a", "the cat sat on the mat", q = 2)
th he t sa on n ma e c ca at s t o m
V1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V2 2 2 2 1 1 1 1 2 1 1 3 1 1 1 1
The stringdist
function returns:
> stringdist("a", "the cat sat on the mat", q = 2, method = "qgram")
[1] Inf
Instead of returning:
> sum(qgrams("a", "the cat sat on the mat", q = 2)[2,])
[1] 21
Did I miss something or is this a bug? Thanks.
stringdist versions: 0.9.4.1 and 0.9.4.2
Currently stringdist::stringdist
assumes an undefined (Inf
) distance when q is larger than the string length.
My reasoning at the time was probably that the map from {the set of all strings over an alphabet Sigma} to {positive integer vectors of length |Sigma|^q} has no explicit definition if q is less than the input string length. This is also how I wrote it down in the stringdist paper.
qgrams
maps such cases to the 0-vector, which is indeed inconsistent.
If I take the definition in the paper of Ukkonen (1992) mapping to the 0-vector is indeed the right choice, implying a bug in stringdist
.
Will fix.