Search code examples
pythonregexnlp

How to keep specific words when preprocessing words for NLP?(str.replace & regex)


I want to remove digit except '3d', this word. I've tried some methods but failed. Please look through my simple code below:


s = 'd3 4 3d'
rep_ls = re.findall('([0-9]+[a-zA-Z]*)', s)

>> ['3', '4', '3d']

for n in rep_ls:
    if n == '3d':
        continue
    s = s.replace(n, '')

>> s = 'd  d'
>> expected = 'd 3d'

Solution

  • Maybe, this expression,

    (?i)(3d)\b|(\D+)|\d+
    

    might work OK with re.sub of \1\2.

    Demo

    If 3D would be also undesired, which we are assuming otherwise here, then (?i) can be safely removed:

    (3d)\b|(\D+)|\d+
    

    Anything else other than 3d that you wish to keep would go in the first capturing group:

    (3d|4d|anything_else)\b|(\D+)|\d+
    

    Test

    import re
    
    regex = r'(?i)(3d)\b|(\D+)|\d+'
    string = '''d3 4 3d'''
    
    print(re.sub(regex, r'\1\2', string))
    

    Output

    d 3d
    

    Demo 2

    RegEx Circuit

    jex.im visualizes regular expressions:

    enter image description here