Given
Word1 content1 content1 content1
content2 content2 content2
content3 content3 content3
Word2
I want to extract as groups content1, content2 and content3. Could you help to make a regex for that? I tried:
Word1[\s:]*((?P<value>[^\n]+)\n)+Word2
with gms flags, but it didn't help. I need regex for python re module.
You can use
import re
text = "Word1 content1 content1 content1\n content2 content2 content2\n content3 content3 content3\nWord2"
match = re.search(r'Word1[\s:]*((?:.+\n)*)Word2', text)
if match:
print([s.strip() for s in match.group(1).splitlines()])
See the Python and the regex demo.
Output:
['content1 content1 content1', 'content2 content2 content2', 'content3 content3 content3']
Details:
Word1
- a Word1
string[\s:]*
- zero or more whitespaces and colons((?:.+\n)*)
- Group 1: zero or more repetitions of one or more chars other than line break chars as many as possible, followed with a newline charWord2
- a Word2
string.Then, if there is a match, [s.strip() for s in match.group(1).splitlines()]
splits the Group 1 value into separate lines.
An alternative solution using the PyPi regex library can be
import regex
text = "Word1 content1 content1 content1\n content2 content2 content2\n content3 content3 content3\nWord2"
print( regex.findall(r'(?<=Word1[\s:]*(?s:.*?))\S(?:.*\S)?(?=(?s:.*?)\nWord2)', text) )
See the Python demo. Details:
(?<=Word1[\s:]*(?s:.*?))
- a positive lookbehind that requires a Word1
string, zero or more whitespaces or colons, and then any zero or more chars as few as possible immediately to the left of the current location\S(?:.*\S)?
- a non-whhitespace char and then any zero or more chars other than line break chars as many as possible till the last non-whitespace char on the line(?=(?s:.*?)\nWord2)
- a positive lookahead that requires any zero or more chars as few as possible and then a newline char and Word2
word to the right of the current location.