I have a vector of locations that I am trying to disambiguate against a vector of correct location names. For this example I am using just two disambiguated locations tho:
agrepl('Au', c("Austin, TX", "Houston, TX"),
max.distance = .000000001,
ignore.case = T, fixed = T)
[1] TRUE TRUE
The help page says that max.distance
is
Maximum distance allowed for a match. Expressed either as integer, or as a fraction of the pattern length times the maximal transformation cost
I am not sure about the mathematical meaning of the Levensthein distance; my understanding is that smaller the distance, the stricter the tolerance for mismatches with my vector of disambiguated strings.
So I would I adjust it to retrieve two FALSE
? Basically I would like to have a TRUE
only when there is a difference of 1 character like in:
agrepl('Austn, TX', "Austin, TX",
max.distance = .000000001, ignore.case = T, fixed = T)
[1] TRUE
The problem you are having is possibly similar to the one I faced when starting the to experiment here. The first argument is a regex-pattern when fixed=TRUE, so small patterns are very permissive if not constrained to be the full string. The help page even has a "Note" about that issue:
Since someone who read the description carelessly even filed a bug report on it, do note that this matches substrings of each element of x (just as grep does) and not whole elements.
Using regex patterns you do this by flanking the pattern
string by "^" and "$", since unlike adist
, agrepl
has no partial parameter:
> agrepl('^Au$', "Austin, TX",
+ max.distance = c(insertions=.15), ignore.case = T, fixed=FALSE)
[1] FALSE
> agrepl('^Austn, TX$', "Austin, TX",
+ max.distance = c(insertions=.15), ignore.case = T, fixed=FALSE)
[1] TRUE
> agrepl('^Austn, T$', "Austin, TX",
+ max.distance = c(insertions=.15), ignore.case = T, fixed=FALSE)
[1] FALSE
So you need to paste0 with those flankers:
> agrepl( paste0('^', 'Austn, Tx', '$'), "Austin, TX",
+ max.distance = c(insertions=.15), ignore.case = T, fixed=FALSE)
[1] TRUE
> agrepl( paste0('^', 'Au', '$'), "Austin, TX",
+ max.distance = c(insertions=.15), ignore.case = T, fixed=FALSE)
[1] FALSE
Might be better to use all
rather than just insertions
, and you may want to lower the fraction.