I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is (\n
is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).
I'd like to capture two things:
some Varying TEXT
partI've tried a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
...and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text.
I'd like match.group(1)
to be some Varying Text
and group(2)
to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, it's supposed to be a sequence of amino acids that make up a protein.
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^
and $
anchors to match linefeeds, but they don't. In multiline mode, ^
matches the position immediately following a newline and $
matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n
), a carriage-return (\r
), or a carriage-return+linefeed (\r\n
). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.