Search code examples
pythonregexpython-re

Python re expression returning whole string, but groups not providing whole string


I'm trying to create a regular expression in python that will match certain elements of a user-inputted string. So far, that is re.match("( 0b[10]+| [0-9]+| '.+?'| \".+?\")+", user_cmd).

When user_cmd = ' 12 0b110110 \' \' " " "str" \'str\'', re.match("( 0b[10]+| [0-9]+| '.+?'| \".+?\")+", user_cmd) returns <re.Match object; span=(0, 32), match=' 12 0b110110 \' \' " " "str" \'str\''> which is the whole string so, because everything is matched, and everything in the regex is in parenthesis, everything should be in a group, right? It turns out not because re.match("( 0b[10]+| [0-9]+| '.+?'| \".+?\")+", user_cmd).groups() returns (" 'str'",) (only one item). Why is this? How do I make the regular expression return each and every item it should return in the groups command?


Solution

  • Your pattern is repeating a captured group, which will capture the value of the last iteration in group 1 which is 'str'

    For your matches, you don't need to repeat a group if you want the separate matches, and you don't need a capture group if you want the matches only.

    What you might do as all the parts start with a space is match a space and use a non capture group with the alternation |.

    Instead of a non greedy quantifier .+? you can use a negated character class to have less backtracking.

     (?:0b[10]+|[0-9]+|'[^']+'|"[^"]+")
    
    • (?: Match a space and start a non capture group for the alternation |
      • 0b[10]+ Match 0b and 1+ occurrences of 1 or 0
      • | or
      • [0-9]+ Match 1+ digits 0-9
      • | Or
      • '[^']+' Match from ' till ' using a negated character class which will match 1+ times any char except '
      • | Or
      • "[^"]+" Match from " till " using another negated character class
    • ) Close non capture group

    Regex demo | Python demo

    For example getting all the matches with re.findall to get all the matches:

    import re
     
    user_cmd = ' 12 0b110110 \' \' " " "str" \'str\''
    pattern = r" (?:0b[10]+|[0-9]+|'[^']+'|\"[^\"]+\")"
     
    print(re.findall(pattern, user_cmd))
    

    Output

    [' 12', ' 0b110110', " ' '", ' " "', ' "str"', " 'str'"]
    

    If you want the full match, you can make use of the captures() using the PyPi regex module

    import regex
    
    pattern = r"""( (?:0b[10]+|[0-9]+|'[^']+'|\"[^\"]+\"))+"""
    user_cmd = ' 12 0b110110 \' \' " " "str" \'str\''
    m = regex.match(pattern, user_cmd)
    print(m.captures(1))
    

    Output

    [' 12', ' 0b110110', " ' '", ' " "', ' "str"', " 'str'"]
    

    Python demo