Disclaimer - This is not a homework question. Normally I wouldn't ask something so simple, but I can't find an elegant solution.
What I am trying to achieve -
Input from OCR: "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022"
Parsed output: "01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022"
Essentially remove spaces from in between digits. However there are caveats like that preceding digits are single character long (to avoid stuff like dates). For a date like "1 1 1970" though it's fine if it gets converted to "11 1970" since it doesn't violate the single character principle.
The most decent regex I could think of was (.*?\D)\d( \d)+
. However this doesn't work for numbers at the beginning of the string. Also search and replace is fairly complicated with this regex (I can't do a re.subn
with this).
Can anyone think of an elegant Python based solution (preferably using regex) to achieve this?
>>> import re
>>> regex = re.compile(r"(?<=\b\d)\s+(?=\d\b)")
>>> regex.sub("", "0 1 loren ipsum 1 2 3 dolor sit 4 5 6 amet -7 8 9 1- date 13 06 2022")
'01 loren ipsum 123 dolor sit 456 amet -7891- date 13 06 2022'
Explanation:
(?<=\b\d) # Assert that a single digit precedes the current position
\s+ # Match one (or more) whitespace character(s)
(?=\d\b) # Assert that a single digit follows the current position
The sub()
operation removes all whitespace that matches this rule.