I'm looking for an efficient way in R to tell if one string might be an abbreviation for another. The basic approach I'm taking is to see if the letters in the shorter string appear in the same order in the longer string. For example, if my shorter string were "abv" and my longer string were "abbreviation", I would want a positive result, whereas if my shorter string were "avb", I would want a negative result. I have a function I've put together that works, but it seems like a pretty inelegant solution, and I thought I might be missing some regex magic. I've also looked at R's 'stringdist' function, but I haven't found anything that seems like it does this particularly. Here's my function:
# This function computes whether one of the input strings (input strings x and y) could be an abbreviation of the other
# input strings should be all the same case, and probably devoid of anything but letters and numbers
abbrevFind = function(x, y) {
# Compute the number of characters in each string
len.x = nchar(x)
len.y = nchar(y)
# Find out which string is shorter, and therefore a possible abbreviation
# split each string into its component characters
if (len.x < len.y) {
# Designate the abbreviation and the full string
abv = substring(x, 1:len.x, 1:len.x)
full = substring(y, 1:len.y, 1:len.y)
} else if (len.x >= len.y) {
abv = substring(y, 1:len.y, 1:len.y)
full = substring(x, 1:len.x, 1:len.x)
}
# Get the number of letters in the abbreviation
small = length(abv)
# Set up old position, which will be a comparison criteria
pos.old = 0
# set up an empty vector which will hold the letter positions of already used letters
letters = c()
# Loop through each letter in the abbreviation
for (i in 1:small) {
# Get the position in the full string of the ith letter in the abbreviation
pos = grep(abv[i], full)
# Exclude positions which have already been used
pos = pos[!pos %in% letters]
# Get the earliest position (note that if the grep found no matches, the min function will return 'Inf' here)
pos = min(pos)
# Store that position
letters[i] = pos
# If there are no matches to the current letter, or the current letter's only match is earlier in the string than the last match
# it is not a possible abbreviation. The loop breaks, and the function returns False
# If the function makes it all the way through without breaking out of the loop, the function will return true
if (is.infinite(pos) | pos <= pos.old) {abbreviation = F; break} else {abbreviation = T}
# Set old position equal to the current position
pos.old = pos
}
return(abbreviation)
}
Thanks for any help!
what about something like this where you basically take each character and add an option to match any letter 0 or more times between each ([a-z]*?
)
f <- Vectorize(function(x, y) {
xx <- strsplit(tolower(x), '')[[1]]
grepl(paste0(xx, collapse = '[a-z]*?'), y)
## add this if you only want to consider letters in y
# grepl(paste0(xx, collapse = sprintf('[%s]*?', tolower(y))), y)
}, vectorize.args = 'x')
f(c('ohb','hello','ob','ohc'), 'ohbother')
# ohb hello ob ohc
# TRUE FALSE TRUE FALSE
f(c('abbrev','abb','abv', 'avb'), 'abbreviation')
# abbrev abb abv avb
# TRUE TRUE TRUE FALSE