Search code examples
pythonregexregex-greedyregex-lookaroundslookbehind

Python (Perl-type) regex lookahead/lookbehind


Consider a string s = "aa,bb11,22 , 33 , 44,cc , dd ".

I would like to split s into the following list of tokens using the regular expressions module in Python, which is similar to the functionality offered by Perl:

  1. "aa,bb11"
  2. "22"
  3. "33"
  4. "44,cc , dd "

Note:

  • I want to tokenise on commas, but only if those commas have numbers to either side.
  • Any (optional) whitespace around these "numerical commas" that I'm targeting should be removed in the result. The optional whitespace may be more than a single space.
  • Any other whitespace should be left as it appears in the original string.

My best attempt so far is the following:

import re

pattern = r'(?<=\d)(\s*),(\s*)(?=\d)'
s = 'aa,bb11,22 , 33 , 44,cc , dd '

print re.compile(pattern).split(s)

but this prints:

['aa,bb11', '', '', '22', ' ', ' ', '33', ' ', ' ', '44,cc , dd ']

which is close to what I want, inasmuch as the 4 things I want are contained in the list. I could go through and get rid of any empty strings and any strings that consist of only spaces/commas, but I'd rather have a single line regex that does all this for me.

Any ideas?


Solution

  • Don't put capture groups on the \s*:

    pattern = r'(?<=\d)\s*,\s*(?=\d)'