Search code examples
pythonregextokenize

Capturing all operators, parentheses and numbers separately in "(3 + 44)* 5 / 7" with Regex


For input string: st = "(3 + 44)* 5 / 7"

I'm looking to get the following result using only regex: ["(", "3", "+", "44", ")", "*", "5", "/", "7"]

Attempts:

  1. >>> re.findall("[()\d+\-*/].?", st)
    ['(3', '+ ', '44', ')*', '5 ', '/ ', '7']
    

    But I need to capture the parentheses in '(3' and ')*' separately as well.

  2. >>> re.findall("[()\d+\-*/]?", st)    
    ['(', '3', '', '+', '', '4', '4', ')', '*', '', '5', '', '/', '', '7', '']
    

    This gives tons of blank tokens.


Solution

  • You can't use multi-character constructs like \d+ in a character class.

    So you can do it by brute force like this:

    re.findall(r"\(|\)|\d+|-|\*|/", st)
    

    Or you can use a character class for single-character tokens, alternated with other things:

    re.findall(r"[()\-*/]|\d+", st)