EDIT: This bug was found in 32-bit versions of R was fixed in R version 2.9.2.
This was tweeted to me by @leoniedu today and I don't have an answer for him so I thought I would post it here.
I have read the documentation for agrep() (fuzzy string matching) and it appears that I don't fully understand the max.distance parameter. Here's an example:
pattern <- "Staatssekretar im Bundeskanzleramt"
x <- "Bundeskanzleramt"
That behaves exactly like I would expect. There are 18 characters different between the strings so I would expect that to be the threshold of a match. Here's what's confusing me:
Why are 30 and 33 matches, but not 31 and 32? To save you some counting,
> nchar("Staatssekretar im Bundeskanzleramt")
[1] 34
> nchar("Bundeskanzleramt")
[1] 16
I posted this on the R list a while back and reported as a bug in R-bugs-list. I had no useful responses, so I twitted to see if the bug was reproducible or I was just missing something. JD Long was able to reproduce it and kindly posted the question here.
Note that, at least in R, then, agrep is a misnomer since it does not matches regular expressions, while grep stands for "Globally search for the Regular Expression and Print". It shouldn't have a problem with patterns longer than the target vector. (i think!)
In my linux server, all is well but not so in my Mac and Windows machines.
Mac: sessionInfo() R version 2.9.1 (2009-06-26) i386-apple-darwin8.11.1 locale: en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
agrep(pattern,x,max.distance=30) [1] 1
agrep(pattern,x,max.distance=31) integer(0) agrep(pattern,x,max.distance=32) integer(0) agrep(pattern,x,max.distance=33) [1] 1
Linux: R version 2.9.1 (2009-06-26) x86_64-unknown-linux-gnu
agrep(pattern,x,max.distance=30) [1] 1 agrep(pattern,x,max.distance=31) [1] 1 agrep(pattern,x,max.distance=32) [1] 1 agrep(pattern,x,max.distance=33) [1] 1