Search code examples
pythonregexregex-lookaroundslookbehind

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?


I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the hyphen in 12-34 should be kept while the equal mark after 123 should be removed.

Here is my python script.

import re
s = "中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)

the expected output should be

中国中国foo中国bar中123国中国12-34中国

but the result is

中国中国foo中国bar中123=国中国12-34中国

I can't figure out why there is an extra equal sign in the output?


Solution

  • Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.

    You can try the following regex:

    u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'
    

    You can use it as such:

    import re
    s = "中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国"
    res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
    print(res.join(''))