Search code examples
regexpython-3.xmultiline

Python regex match across multiple lines


I am trying to match a regex pattern across multiple lines. The pattern begins and ends with a substring, both of which must be at the beginning of a line. I can match across lines, but I can't seem to specify that the end pattern must also be at the beginning of a line.

Example string:

Example=N      ; Comment Line One error=

; Comment Line Two.

Desired=

I am trying to match from Example= up to Desired=. This will work if error= is not in the string. However, when it is present I match Example=N ; Comment Line One error=

config_value = 'Example'
pattern = '^{}=(.*?)([A-Za-z]=)'.format(config_value)
match = re.search(pattern, string, re.M | re.DOTALL)

I also tried:

config_value = 'Example'
pattern = '^{}=(.*?)(^[A-Za-z]=)'.format(config_value)
match = re.search(pattern, string, re.M | re.DOTALL)

Solution

  • Your pattern contains .*? and your options include re.DOTALL (also, re.S is its equivalent) that makes the . match newlines, too. For those of you who wonder why your regex does not match across multiple lines, check these first:

    • Make sure you read the whole text where an expected match can span across lines into a single variable (remember that regex can only be used to search for strings inside strings, text)
    • Make sure your pattern actually can match any characters and the most common problem is that you use . without re.S/re.DOTALL flags (or their inline equivalent (?s)). As an example, you can't expect a match in re.search(r'a(.*?)b', '__ a\nb __'), but you will find a match in re.search(r'a(.*?)b', '__ a\nb __', re.DOTALL) or re.search(r'(?s)a(.*?)b', '__ a\nb __')
    • Make sure you test the regex properly. If you use online regex testing tools, remember to replace any escape sequences with literal symbols. It is a common issue to just copy/paste the string literal and use it in the example text field, which leads to confusion. So, if you want to test against '__ a\nb __', paste it there, and replace the \n with the real literal line break
    • DO NOT CONFUSE re.S/re.DOTALL with re.M/re.MULTILINE. The latter is used to re-define the behavior of the ^ and $ anchors only. It means the ^ will match the start of any line and the $ will match the end of any line if you use re.M/re.MULTILINE. It does not mean your regex will automatically start finding matches that span across multiple lines, please mind this.

    More references:

    Now, for this concrete case in the OP, you may use

    config_value = 'Example'
    pattern=r'(?sm)^{}=(.*?)(?=[\r\n]+\w+=|\Z)'.format(config_value)
    match = re.search(pattern, s)
    if match:
        print(match.group(1))
    

    See the Python demo.

    Pattern details

    • (?sm) - re.DOTALL and re.M are on
    • ^ - start of a line
    • Example= - a substring
    • (.*?) - Group 1: any 0+ chars, as few as possible
    • (?=[\r\n]+\w+=|\Z) - a positive lookahead that requires the presence of 1+ CR or LF symbols followed with 1 or more word chars followed with a = sign, or end of the string (\Z).

    See the regex demo.