Search code examples
pythonregexstring-matching

Regex: Select the First Closest Exact Match Until the End


Objective: to extract the first email from an email thread

Description: Based on manual inspection of the emails, I realized that the next email in the email thread always starts with a set of From, Sent, To and Subject

Test Input:

Hello World from: the other side of the first email

from: this
sent: at
to: that
subject: what

second email

from: this
sent: at
to: that
subject: what


third email

from: this
date: at
to: that
subject: what

fourth email

Expected output:

Hello World from: the other side of the first email

Failed Attempts:

Following breaks when there's a from: in the first email

(.*)((from:[\s\S]+?)(sent:[\s\S]+?)(to:[\s\S]+?)(subject:[\s\S]+))

Following fails when there are repeated groups of From, Sent, To and Subject

([\s\S]+)((from:(?:(?!from:)[\s\S])+?sent:(?:(?!sent:)[\s\S])+?to:(?:(?!to:)[\s\S])+?subject:(?:(?!subject:)[\s\S])+))

The second attempt works with PCRE(PHP) when an ungreedy option (flag) is selected. However, this option is not available in python and I couldn't figure out a way to make it work.

Regex101 demo


Solution

  • To only get the first match, you could use a capturing group and match exactly what should follow.

    ^(.*)\r?\n\s*\r?\nfrom:.*\r?\nsent:.*\r?\nto:.*\r?\nsubject:
    
    • ^ Start of string
    • (.*) Match any char except a newline 0+ times
    • \r?\n\s* Match a newline followed by 0+ times a whitespace char using \s*
    • \r?\nfrom:.* Match the next line starting with from:
    • \r?\nsent:.* Match the next line starting with sent:
    • \r?\nto:.* Match the next line starting with to:
    • \r?\nsubject:.* Match the next line starting with subject:

    Note that in the demo link the global flag g at the right top is not enabled.

    Regex demo | Python demo

    If the first line can span multiple lines and if it acceptable to note cross any of the lines that start with from:, sent:, to: or subject: you could also use a negative lookahead.

    ^(.*(?:\r?\n(?!(?:from|sent|to|subject):).*)*)\r?\n\s*\r?\nfrom:.*\r?\nsent:.*\r?\nto:.*\r?\nsubject:
    

    Regex demo

    If there are spaces between from, sent, to and subject 0+ (*) whitespace characters can be matched

    ^(.*(?:\r?\n(?!(?:from|sent|to|subject):).*)*)\r?\s*\r?\sfrom:.*\r?\s*sent:.*\r?\s*to:.*\r?\s*subject:
    

    Regex demo