Objective: to extract the first email from an email thread
Description: Based on manual inspection of the emails, I realized that the next email in the email thread always starts with a set of From, Sent, To and Subject
Test Input:
Hello World from: the other side of the first email
from: this
sent: at
to: that
subject: what
second email
from: this
sent: at
to: that
subject: what
third email
from: this
date: at
to: that
subject: what
fourth email
Expected output:
Hello World from: the other side of the first email
Failed Attempts:
Following breaks when there's a from:
in the first email
(.*)((from:[\s\S]+?)(sent:[\s\S]+?)(to:[\s\S]+?)(subject:[\s\S]+))
Following fails when there are repeated groups of From, Sent, To and Subject
([\s\S]+)((from:(?:(?!from:)[\s\S])+?sent:(?:(?!sent:)[\s\S])+?to:(?:(?!to:)[\s\S])+?subject:(?:(?!subject:)[\s\S])+))
The second attempt works with PCRE(PHP) when an ungreedy option (flag) is selected. However, this option is not available in python and I couldn't figure out a way to make it work.
To only get the first match, you could use a capturing group and match exactly what should follow.
^(.*)\r?\n\s*\r?\nfrom:.*\r?\nsent:.*\r?\nto:.*\r?\nsubject:
^
Start of string(.*)
Match any char except a newline 0+ times\r?\n\s*
Match a newline followed by 0+ times a whitespace char using \s*
\r?\nfrom:.*
Match the next line starting with from:
\r?\nsent:.*
Match the next line starting with sent:
\r?\nto:.*
Match the next line starting with to:
\r?\nsubject:.*
Match the next line starting with subject:
Note that in the demo link the global flag g
at the right top is not enabled.
If the first line can span multiple lines and if it acceptable to note cross any of the lines that start with from:
, sent:
, to:
or subject:
you could also use a negative lookahead.
^(.*(?:\r?\n(?!(?:from|sent|to|subject):).*)*)\r?\n\s*\r?\nfrom:.*\r?\nsent:.*\r?\nto:.*\r?\nsubject:
If there are spaces between from
, sent
, to
and subject
0+ (*
) whitespace characters can be matched
^(.*(?:\r?\n(?!(?:from|sent|to|subject):).*)*)\r?\s*\r?\sfrom:.*\r?\s*sent:.*\r?\s*to:.*\r?\s*subject: