Search code examples
ragrep

How to match a string with a tolerance of one character?


I have a vector of locations that I am trying to disambiguate against a vector of correct location names. For this example I am using just two disambiguated locations tho:

agrepl('Au', c("Austin, TX", "Houston, TX"), 
max.distance =  .000000001, 
ignore.case = T, fixed = T)
[1] TRUE TRUE

The help page says that max.distance is

Maximum distance allowed for a match. Expressed either as integer, or as a fraction of the pattern length times the maximal transformation cost

I am not sure about the mathematical meaning of the Levensthein distance; my understanding is that smaller the distance, the stricter the tolerance for mismatches with my vector of disambiguated strings.

So I would I adjust it to retrieve two FALSE? Basically I would like to have a TRUE only when there is a difference of 1 character like in:

agrepl('Austn, TX', "Austin, TX", 
max.distance =  .000000001, ignore.case = T, fixed = T)
[1] TRUE

Solution

  • The problem you are having is possibly similar to the one I faced when starting the to experiment here. The first argument is a regex-pattern when fixed=TRUE, so small patterns are very permissive if not constrained to be the full string. The help page even has a "Note" about that issue:

    Since someone who read the description carelessly even filed a bug report on it, do note that this matches substrings of each element of x (just as grep does) and not whole elements.

    Using regex patterns you do this by flanking the pattern string by "^" and "$", since unlike adist, agrepl has no partial parameter:

    > agrepl('^Au$', "Austin, TX", 
    + max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
    [1] FALSE
    > agrepl('^Austn, TX$', "Austin, TX", 
    + max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
    [1] TRUE
    > agrepl('^Austn, T$', "Austin, TX", 
    + max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
    [1] FALSE
    

    So you need to paste0 with those flankers:

    > agrepl( paste0('^', 'Austn, Tx', '$'), "Austin, TX", 
    + max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
    [1] TRUE
    > agrepl( paste0('^', 'Au', '$'), "Austin, TX", 
    + max.distance =  c(insertions=.15),  ignore.case = T, fixed=FALSE)
    [1] FALSE
    

    Might be better to use all rather than just insertions, and you may want to lower the fraction.