Search code examples
pythonregexpython-re

Regex: split new lines between constant words


Given

Word1   content1 content1 content1
       content2 content2 content2
         
          content3 content3 content3
Word2

I want to extract as groups content1, content2 and content3. Could you help to make a regex for that? I tried:

Word1[\s:]*((?P<value>[^\n]+)\n)+Word2 with gms flags, but it didn't help. I need regex for python re module.


Solution

  • You can use

    import re
    text = "Word1   content1 content1 content1\n       content2 content2 content2\n          content3 content3 content3\nWord2"
    match = re.search(r'Word1[\s:]*((?:.+\n)*)Word2', text)
    if match:
        print([s.strip() for s in match.group(1).splitlines()])
    

    See the Python and the regex demo.

    Output:

    ['content1 content1 content1', 'content2 content2 content2', 'content3 content3 content3']
    

    Details:

    • Word1 - a Word1 string
    • [\s:]* - zero or more whitespaces and colons
    • ((?:.+\n)*) - Group 1: zero or more repetitions of one or more chars other than line break chars as many as possible, followed with a newline char
    • Word2 - a Word2 string.

    Then, if there is a match, [s.strip() for s in match.group(1).splitlines()] splits the Group 1 value into separate lines.

    An alternative solution using the PyPi regex library can be

    import regex
    text = "Word1   content1 content1 content1\n       content2 content2 content2\n          content3 content3 content3\nWord2"
    print( regex.findall(r'(?<=Word1[\s:]*(?s:.*?))\S(?:.*\S)?(?=(?s:.*?)\nWord2)', text) )
    

    See the Python demo. Details:

    • (?<=Word1[\s:]*(?s:.*?)) - a positive lookbehind that requires a Word1 string, zero or more whitespaces or colons, and then any zero or more chars as few as possible immediately to the left of the current location
    • \S(?:.*\S)? - a non-whhitespace char and then any zero or more chars other than line break chars as many as possible till the last non-whitespace char on the line
    • (?=(?s:.*?)\nWord2) - a positive lookahead that requires any zero or more chars as few as possible and then a newline char and Word2 word to the right of the current location.