python python-3.x regex string data-extraction

remove r n r n from string

I want to remove extra r and n from this string. I tried regex. Not sure if regex or some other method would be helpful here.

This is the code I am trying to use import re

text = "r n r n r nFamily Medical History new r n  r n r r r  Roger nRobert n nDawson n49 nyears old , right shoulder"

regex_pattern = re.compile(r'\s[rn]\s')
matches = regex_pattern.findall(text)
for match in matches:
    text = text.replace(match," ")
print(text)

Current Output:

r nFamily Medical History new   Roger nRobert nDawson n49 nyears old , right shoulder

we still see many r n. Also wondering how to remove 'n' from n49, nyears and remove first 'n' from Dawson without removing last 'n'

Expected Output:

Family Medical History new Roger Robert Dawson 49 years old , right shoulder

Solution

I would suggest a bit of an NLP approach here as I do not see how regex can tell nyears (wrong spelling) from new (correct spelling).

First, remove all standalone r / n and those glued to capitalized words and numbers, then split the string and check each word that starts with n or r with a spellchecker. The first n can be removed if word[1:] is correct and word is not. If both are not correct, I think it is safe to fallback to the word.

To run spellcheck, for example, you can use TextBlob.

Here is a Python code demo:

from textblob import TextBlob
from textblob import Word
import re

s = "r n r n r nFamily Medical History new r n  r n r r r  Roger nRobert n nDawson n49 nyears old , right shoulder"
s = re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s)
result = []
for w in s.split():
  if not w.startswith(('n','r')): # The w word does not start with n or r...
    result.append(w)              # Add it to the result
  else:
    if Word(w).correct() == w:    # If w is a correct word
      result.append(w)            # Add it to the result
    else:
      if Word(w[1:]).correct() == w[1:]: # If w[1:] is correct 
        result.append(w[1:])             # Add w[1:] to the result
      else:
        result.append(w)                 # Fallback: add w to the result
print(" ".join(result))
# => Family Medical History new Roger Robert Dawson 49 years old , right shoulder

The re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s) part remove r and n at the start of words if immediately followed with an uppercase letter, digit or end of string.

Then, for w in s.split(): iterates over the words in the sentence and replaces the word only in case it starts with n or r and has a spelling error with w[1:].

DISCLAIMER: TextBlob is used as an example. You are free to use any other spellchecking library. TextBlob spellchecking "is based on Peter Norvig’s “How to Write a Spelling Corrector”1 as implemented in the pattern library. It is about 70% accurate"