I want to remove extra r and n from this string. I tried regex. Not sure if regex or some other method would be helpful here.
This is the code I am trying to use import re
text = "r n r n r nFamily Medical History new r n r n r r r Roger nRobert n nDawson n49 nyears old , right shoulder"
regex_pattern = re.compile(r'\s[rn]\s')
matches = regex_pattern.findall(text)
for match in matches:
text = text.replace(match," ")
print(text)
Current Output:
r nFamily Medical History new Roger nRobert nDawson n49 nyears old , right shoulder
we still see many r n. Also wondering how to remove 'n' from n49, nyears and remove first 'n' from Dawson without removing last 'n'
Expected Output:
Family Medical History new Roger Robert Dawson 49 years old , right shoulder
I would suggest a bit of an NLP approach here as I do not see how regex can tell nyears
(wrong spelling) from new
(correct spelling).
First, remove all standalone r
/ n
and those glued to capitalized words and numbers, then split the string and check each word that starts with n
or r
with a spellchecker. The first n
can be removed if word[1:]
is correct and word
is not. If both are not correct, I think it is safe to fallback to the word
.
To run spellcheck, for example, you can use TextBlob
.
Here is a Python code demo:
from textblob import TextBlob
from textblob import Word
import re
s = "r n r n r nFamily Medical History new r n r n r r r Roger nRobert n nDawson n49 nyears old , right shoulder"
s = re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s)
result = []
for w in s.split():
if not w.startswith(('n','r')): # The w word does not start with n or r...
result.append(w) # Add it to the result
else:
if Word(w).correct() == w: # If w is a correct word
result.append(w) # Add it to the result
else:
if Word(w[1:]).correct() == w[1:]: # If w[1:] is correct
result.append(w[1:]) # Add w[1:] to the result
else:
result.append(w) # Fallback: add w to the result
print(" ".join(result))
# => Family Medical History new Roger Robert Dawson 49 years old , right shoulder
The re.sub(r'\b[rn](?=[A-Z0-9\s]|$)', '', s)
part remove r
and n
at the start of words if immediately followed with an uppercase letter, digit or end of string.
Then, for w in s.split():
iterates over the words in the sentence and replaces the word only in case it starts with n
or r
and has a spelling error with w[1:]
.
DISCLAIMER: TextBlob
is used as an example. You are free to use any other spellchecking library. TextBlob spellchecking "is based on Peter Norvig’s “How to Write a Spelling Corrector”1 as implemented in the pattern library. It is about 70% accurate"