Search code examples
pythonregexstringtexttext-files

Remove all numbers except for the ones combined to string using python regex


Trying to use a regex function to remove a word, whitespaces, special characters and numbers but not the one combined with to a word/string. E.g.

ORIGIN
    1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
    61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

//

The \W+ removes all numbers including 1 in malwmrll1

import re

text_file = open('mytext.txt').read()
new_txt = re.sub('[\\b\\d+\\b\s*$+\sORIGIN$\W+]', '', text_file)

print(new_txt, len(new_txt))

My output is:

malwmrllplallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 109

The desired output should be: malwmrll1plallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110


Solution

  • Right, depending on your desired result showing underscores at all or not, try to use re.findall and raw-string notation. You currently use a character class that makes no sense:


    \b(?!(?:ORIGIN|[_\d]+)\b)\w+
    

    See an online demo


    • \b - Word-boundary;
    • (?!(?:ORIGIN|[_\d]+)\b) - Negative lookahead with nested non-capture group to match either ORIGIN or 1+ underscore/digit combinations before a trailing word-boundary;
    • \w+ - 1+ word-characters.

    import re
      
    text_file = """ORIGIN
        1 malwmrllp1 lallalwgpd paaafvnghl cgshlvealy lvcgergffy tpktrreaed
        61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn
    
    //"""
    
    new_txt=''.join(re.findall(r'\b(?!(?:ORIGIN|[_\d]+)\b)\w+', text_file))    
    print(new_txt, len(new_txt))
    

    Prints:

    malwmrllp1lallalwgpdpaaafvnghlcgshlvealylvcgergffytpktrreaedlqvgqvelgggpgagslqplalegslqkrgiveqcctsicslyqlenycn 110