I have a large file filled with content in the following pattern. I would like to split each email block. Criteria for a block is that everything from Subject:
to second occurrence of Subject:
only if it is followed by From:
.
Subject: Hello
From: John Doe
Date: Fri, 12 Feb 2010 09:13:51 +0200
Lorem ipsum...
Subject: How are you.
I am fine
Subject: Howdy
From: Jane Doe
Date: Fri, 12 Feb 2010 09:58:14 +0200
Lorem ipsum...
Subject: Re: Howdy
From: Eminem
In the example above, first email block would be:
Subject: Hello
From: John Doe
Date: Fri, 12 Feb 2010 09:13:51 +0200
Lorem ipsum...
Subject: How are you.
I am fine
Second email block:
Subject: Howdy
From: Jane Doe
Date: Fri, 12 Feb 2010 09:58:14 +0200
Lorem ipsum...
I have tried the following method but it doesn't work for all the cases.
email_blocks = re.split(r'\n(?=Subject:)', email_data)
It incorrectly splits the first block into two separate blocks because it only looks for the keyword Subject:
. What I need is a way to split from Subject:
to second Subject:
only if followed by From:
.
I have also tried the following but it didn't create an array of blocks. It only returned the last block:
email_blocks = re.findall(r'Subject:.*?(?=Subject:|\nFrom:|$)', email_data, re.DOTALL)
re.split(r'\n(?=Subject:.*\nFrom:)', str)
splits on a newline, looking ahead for a "Subject:", followed by anything, followed by a newline then a "From:"
# Output:
['Subject: Hello\nFrom: John Doe\nDate: Fri, 12 Feb 2010 09:13:51 +0200\nLorem ipsum...\nSubject: How are you.\nI am fine',
'Subject: Howdy\nFrom: Jane Doe\nDate: Fri, 12 Feb 2010 09:58:14 +0200\nLorem ipsum...',
'Subject: Re: Howdy\nFrom: Eminem']