pythonregex

Strange behavior of capturing group in regular expression


Given the following simple regular expression which goal is to capture the text between quotes characters:

regexp = '"?(.+)"?'

When the input is something like:

"text"

The capturing group(1) has the following:

text"

I expected the group(1) to have text only (without the quotes). Could somebody explain what's going on and why the regular expression is capturing the " symbol even when it's outside the capturing group #1. Another strange behavior that I don't understand is why the second quote character is captured but not the first one given that both of them are optional. Finally I fixed it by using the following regex, but I would like to understand what I'm doing wrong:

regexp = '"?([^"]+)"?'

Solution

  • Solution

    regexp = '^"?(.*?)"?$'
    

    Or, if the regex engine allows lookarounds

    regexp = '(?<=^"?).*?(?="?$)'
    

    Details

    • ^ - start of string
    • "? - an optional " char
    • (.*?) - Group 1: any zero or more chars other than line break chars as few as possible
    • "? - an optional " char
    • $ - end of string. Explanation

    why the regular expression is capturing the " symbol even when it's outside the capturing group #1

    The "?(.+)"? pattern contains a greedy dot matching subpattern. A . can match a ", too. The "? is an optional subpattern. It means that if the previous subpattern is greedy (and .+ is a greedy subpattern) and can match the subsequent subpattern (and . can match a "), the .+ will take over that optional value.

    The negated character class is a correct way to match any characters but a certain one/range(s) of characters. [^"] will never match a ", so the last " will never get matched with this pattern.

    why the second quote character is captured but not the first one given that both of them are optional

    The first "? comes before the greedy dot matching pattern. The engine sees the " (if it is in the string) and matches the quote with the first "?.