Trying to use a regex function to remove a word, whitespaces, special characters and numbers but not the one combined with to a word/string. E.g.
ORIGIN
1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn
//
The \W+ removes all numbers including 1 in malwmrll1
import re
text_file = open('mytext.txt').read()
new_txt = re.sub('[\\b\\d+\\b\s*$+\sORIGIN$\W+]', '', text_file)
print(new_txt, len(new_txt))
My output is:
malwmrllplallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 109
The desired output should be: malwmrll1plallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110
Right, depending on your desired result showing underscores at all or not, try to use re.findall
and raw-string notation. You currently use a character class that makes no sense:
\b(?!(?:ORIGIN|[_\d]+)\b)\w+
See an online demo
\b
- Word-boundary;(?!(?:ORIGIN|[_\d]+)\b)
- Negative lookahead with nested non-capture group to match either ORIGIN
or 1+ underscore/digit combinations before a trailing word-boundary;\w+
- 1+ word-characters.import re
text_file = """ORIGIN
1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn
//"""
new_txt=''.join(re.findall(r'\b(?!(?:ORIGIN|[_\d]+)\b)\w+', text_file))
print(new_txt, len(new_txt))
Prints:
malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110