python regex regex-lookarounds lookbehind

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?

I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the hyphen in 12-34 should be kept while the equal mark after 123 should be removed.

Here is my python script.

import re
s = "中国，中，。》％国foo中¥国bar@中123=国％中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)

the expected output should be

中国中国foo中国bar中123国中国12-34中国

but the result is

中国中国foo中国bar中123=国中国12-34中国

I can't figure out why there is an extra equal sign in the output?

Solution

Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.

You can try the following regex:

u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'

You can use it as such:

import re
s = "中国，中，。》％国foo中¥国bar@中123=国％中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))