Search code examples
regexrstringstringdist

R look for abbreviation in full string


I'm looking for an efficient way in R to tell if one string might be an abbreviation for another. The basic approach I'm taking is to see if the letters in the shorter string appear in the same order in the longer string. For example, if my shorter string were "abv" and my longer string were "abbreviation", I would want a positive result, whereas if my shorter string were "avb", I would want a negative result. I have a function I've put together that works, but it seems like a pretty inelegant solution, and I thought I might be missing some regex magic. I've also looked at R's 'stringdist' function, but I haven't found anything that seems like it does this particularly. Here's my function:

# This function computes whether one of the input strings (input strings x and y) could be an abbreviation of the other
# input strings should be all the same case, and probably devoid of anything but letters and numbers
abbrevFind = function(x, y) {

  # Compute the number of characters in each string
  len.x = nchar(x)
  len.y = nchar(y)

  # Find out which string is shorter, and therefore a possible abbreviation
  # split each string into its component characters
  if (len.x < len.y) {

    # Designate the abbreviation and the full string
    abv = substring(x, 1:len.x, 1:len.x)
    full = substring(y, 1:len.y, 1:len.y)

  } else if (len.x >= len.y) {

    abv = substring(y, 1:len.y, 1:len.y)
    full = substring(x, 1:len.x, 1:len.x)

  }

  # Get the number of letters in the abbreviation
  small = length(abv)

  # Set up old position, which will be a comparison criteria
  pos.old = 0

  # set up an empty vector which will hold the letter positions of already used letters
  letters = c()

  # Loop through each letter in the abbreviation
  for (i in 1:small) {

    # Get the position in the full string of the ith letter in the abbreviation
    pos = grep(abv[i], full)
    # Exclude positions which have already been used
    pos = pos[!pos %in% letters]
    # Get the earliest position (note that if the grep found no matches, the min function will return 'Inf' here)
    pos = min(pos)
    # Store that position
    letters[i] = pos

    # If there are no matches to the current letter, or the current letter's only match is earlier in the string than the last match
    # it is not a possible abbreviation. The loop breaks, and the function returns False
    # If the function makes it all the way through without breaking out of the loop, the function will return true
    if (is.infinite(pos) | pos <= pos.old) {abbreviation = F; break} else {abbreviation = T}

    # Set old position equal to the current position
    pos.old = pos

  }

  return(abbreviation)

}

Thanks for any help!


Solution

  • what about something like this where you basically take each character and add an option to match any letter 0 or more times between each ([a-z]*?)

    f <- Vectorize(function(x, y) {
      xx <- strsplit(tolower(x), '')[[1]]
      grepl(paste0(xx, collapse = '[a-z]*?'), y)
      ## add this if you only want to consider letters in y
      # grepl(paste0(xx, collapse = sprintf('[%s]*?', tolower(y))), y)
    }, vectorize.args = 'x')
    
    f(c('ohb','hello','ob','ohc'), 'ohbother')
    #  ohb hello    ob   ohc 
    # TRUE FALSE  TRUE FALSE 
    
    f(c('abbrev','abb','abv', 'avb'), 'abbreviation')
    # abbrev    abb    abv    avb 
    #   TRUE   TRUE   TRUE  FALSE