Search code examples
rstringdist

Why does R stringdist return Inf in q-gram distance with one string shorter than q?


I understand that the q-gram distance is the sum of absolute differences between q-gram vectors of both strings. But I see some weird behavior when one of the strings is shorter than the chosen q.

So for these two strings, while the qgrams function is correct:

> qgrams("a", "the cat sat on the mat", q = 2)
   th he t  sa on n  ma e   c ca at  s  t  o  m
V1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
V2  2  2  2  1  1  1  1  2  1  1  3  1  1  1  1

The stringdist function returns:

> stringdist("a", "the cat sat on the mat", q = 2, method = "qgram")
[1] Inf

Instead of returning:

> sum(qgrams("a", "the cat sat on the mat", q = 2)[2,])
[1] 21

Did I miss something or is this a bug? Thanks.

stringdist versions: 0.9.4.1 and 0.9.4.2


Solution

  • Currently stringdist::stringdist assumes an undefined (Inf) distance when q is larger than the string length.

    My reasoning at the time was probably that the map from {the set of all strings over an alphabet Sigma} to {positive integer vectors of length |Sigma|^q} has no explicit definition if q is less than the input string length. This is also how I wrote it down in the stringdist paper.

    qgrams maps such cases to the 0-vector, which is indeed inconsistent.

    If I take the definition in the paper of Ukkonen (1992) mapping to the 0-vector is indeed the right choice, implying a bug in stringdist.

    Will fix.