Search code examples
python-3.xregexlistnlp

Need help to remove punctuation and replace numbers for an nlp task


For example, I have a string:

sentence = ['cracked $300 million','she\'s resolutely, smitten ', 'that\'s creative [r]', 'the market ( knowledge check : prices up!']

I want to remove the punctuation and replace numbers with the '£' symbol. I have tried this but can only replace one or the other when I try to run them both. my code is below

import re
s =([re.sub(r'[!":$()[]\',]',' ', word) for word in sentence]) 

s= [([re.sub(r'\d+','£', word) for word in s])]
s)

I think the problem could be in the square brackets?? thank you!


Solution

  • If you want to replace some specific punctuation symbols with a space and any digit chunks with a £ sign, you can use

    import re
    rx = re.compile(r'''[][!":$()',]|(\d+)''')
    sentence = ['cracked $300 million','she\'s resolutely, smitten ', 'that\'s creative [r]', 'the market ( knowledge check : prices up!']
    s = [rx.sub(lambda x: '£' if x.group(1) else ' ', word) for word in sentence] 
    print(s) # => ['cracked  £ million', 'she s resolutely  smitten ', 'that s creative  r ', 'the market   knowledge check   prices up ']
    

    See the Python demo.

    Note where [] are inside a character class: when ] is at the start, it does not need to be escaped and [ does not have to be escaped at all inside character classes. I also used a triple-quoted string literal, so you can use " and ' as is without extra escaping.

    So, here, [][!":$()',]|(\d+) matches ], [, !, ", :, $, (, ), ' or , or matches and captures into Group 1 one or more digits. If Group 1 matched, the replacement is the euro sign, else, it is a space.