Search code examples
pythonregexpython-re

Nice regex for cleaning up space separated digits in Python


Disclaimer - This is not a homework question. Normally I wouldn't ask something so simple, but I can't find an elegant solution.

What I am trying to achieve -

Input from OCR: "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022"
Parsed output: "01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022"

Essentially remove spaces from in between digits. However there are caveats like that preceding digits are single character long (to avoid stuff like dates). For a date like "1 1 1970" though it's fine if it gets converted to "11 1970" since it doesn't violate the single character principle.

The most decent regex I could think of was (.*?\D)\d( \d)+. However this doesn't work for numbers at the beginning of the string. Also search and replace is fairly complicated with this regex (I can't do a re.subn with this).

Can anyone think of an elegant Python based solution (preferably using regex) to achieve this?


Solution

  • >>> import re
    >>> regex = re.compile(r"(?<=\b\d)\s+(?=\d\b)")
    >>> regex.sub("", "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022")
    '01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022'
    

    Explanation:

    (?<=\b\d) # Assert that a single digit precedes the current position
    \s+       # Match one (or more) whitespace character(s)
    (?=\d\b)  # Assert that a single digit follows the current position
    

    The sub() operation removes all whitespace that matches this rule.