Search code examples
pythonregexspace

Python - How to remove spaces between Chinese characters while remaining the spaces in between a character and a number?


the real issue may be more complicated, but for now, I'm trying do accomplish something a bit easier. I'm trying to remove space in between 2 Chinese/Japanese characters, but at the same time maintaining the space between a number and a character. An example below:

text = "今天特别 热,但是我买了 3 个西瓜。"

The output I want to get is

text = "今天特别热,但是我买了 3 个西瓜。"

I tried to use Python script and regular expression:

import re
text = re.sub(r'\s(?=[^A-z0-9])','')

However, the result is

text = '今天特别热,但是我买了 3个西瓜。'

So I'm struggling about how I can maintain the space between a character and a number at all time? And I don't want to use a method of adding a space between "3" and "个".

I'll continue to think about it, but let me know if you have ideas...Thank you so much in advance!


Solution

  • I understand the spaces you need to remove reside in between letters.

    Use

    re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text)
    

    Details:

    • (?<=[^\W\d_]) - a positive lookbehind requiring a Unicode letter immediately to the left of the current location
    • \s+ - 1+ whitespaces (remove + if only one is expected)
    • (?=[^\W\d_]) - a positive lookahead that requires a Unicode letter immediately to the right of the current location.

    You do not need re.U flag since it is on by default in Python 3. You need it in Python 2 though.

    You may also use capturing groups:

    re.sub(r'([^\W\d_])\s+([^\W\d_])', r'\1\2', text)
    

    where the non-consuming lookarounds are turned into consuming capturing groups ((...)). The \1 and \2 in the replacement pattern are backreferences to the capturing group values.

    See a Python 3 online demo:

    import re
    text = "今天特别 热,但是我买了 3 个西瓜。"
    print(re.sub(r'(?<=[^\W\d_])\s+(?=[^\W\d_])', '', text))
    // => 今天特别热,但是我买了 3 个西瓜。