Search code examples
pythonpython-3.xregexstringdata-extraction

remove r n r n from string


I want to remove extra r and n from this string. I tried regex. Not sure if regex or some other method would be helpful here.

This is the code I am trying to use import re

text = "r n r n r nFamily Medical History new r n  r n r r r  Roger nRobert n nDawson n49 nyears old , right shoulder"

regex_pattern = re.compile(r'\s[rn]\s')
matches = regex_pattern.findall(text)
for match in matches:
    text = text.replace(match," ")
print(text)

Current Output:

r nFamily Medical History new   Roger nRobert nDawson n49 nyears old , right shoulder 

we still see many r n. Also wondering how to remove 'n' from n49, nyears and remove first 'n' from Dawson without removing last 'n'

Expected Output:

Family Medical History new Roger Robert Dawson 49 years old , right shoulder

Solution

  • I would suggest a bit of an NLP approach here as I do not see how regex can tell nyears (wrong spelling) from new (correct spelling).

    First, remove all standalone r / n and those glued to capitalized words and numbers, then split the string and check each word that starts with n or r with a spellchecker. The first n can be removed if word[1:] is correct and word is not. If both are not correct, I think it is safe to fallback to the word.

    To run spellcheck, for example, you can use TextBlob.

    Here is a Python code demo:

    from textblob import TextBlob
    from textblob import Word
    import re
    
    s = "r n r n r nFamily Medical History new r n  r n r r r  Roger nRobert n nDawson n49 nyears old , right shoulder"
    s = re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s)
    result = []
    for w in s.split():
      if not w.startswith(('n','r')): # The w word does not start with n or r...
        result.append(w)              # Add it to the result
      else:
        if Word(w).correct() == w:    # If w is a correct word
          result.append(w)            # Add it to the result
        else:
          if Word(w[1:]).correct() == w[1:]: # If w[1:] is correct 
            result.append(w[1:])             # Add w[1:] to the result
          else:
            result.append(w)                 # Fallback: add w to the result
    print(" ".join(result))
    # => Family Medical History new Roger Robert Dawson 49 years old , right shoulder
    

    The re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s) part remove r and n at the start of words if immediately followed with an uppercase letter, digit or end of string.

    Then, for w in s.split(): iterates over the words in the sentence and replaces the word only in case it starts with n or r and has a spelling error with w[1:].

    DISCLAIMER: TextBlob is used as an example. You are free to use any other spellchecking library. TextBlob spellchecking "is based on Peter Norvig’s “How to Write a Spelling Corrector”1 as implemented in the pattern library. It is about 70% accurate"