Search code examples
pythonregexfilepython-re

Split Large File Into Blocks Based on Regex Criteria


I have a large file filled with content in the following pattern. I would like to split each email block. Criteria for a block is that everything from Subject: to second occurrence of Subject: only if it is followed by From:.

Subject: Hello
From: John Doe
Date: Fri, 12 Feb 2010 09:13:51 +0200
Lorem ipsum...

Subject: How are you.
I am fine

Subject: Howdy
From: Jane Doe
Date: Fri, 12 Feb 2010 09:58:14 +0200
Lorem ipsum...

Subject: Re: Howdy
From: Eminem

In the example above, first email block would be:

Subject: Hello
From: John Doe
Date: Fri, 12 Feb 2010 09:13:51 +0200
Lorem ipsum...

Subject: How are you.
I am fine

Second email block:

Subject: Howdy
From: Jane Doe
Date: Fri, 12 Feb 2010 09:58:14 +0200
Lorem ipsum...

I have tried the following method but it doesn't work for all the cases.

email_blocks = re.split(r'\n(?=Subject:)', email_data)

It incorrectly splits the first block into two separate blocks because it only looks for the keyword Subject:. What I need is a way to split from Subject: to second Subject: only if followed by From:.

I have also tried the following but it didn't create an array of blocks. It only returned the last block:

email_blocks = re.findall(r'Subject:.*?(?=Subject:|\nFrom:|$)', email_data, re.DOTALL)


Solution

  • re.split(r'\n(?=Subject:.*\nFrom:)', str) splits on a newline, looking ahead for a "Subject:", followed by anything, followed by a newline then a "From:"

    # Output:
    ['Subject: Hello\nFrom: John Doe\nDate: Fri, 12 Feb 2010 09:13:51 +0200\nLorem ipsum...\nSubject: How are you.\nI am fine', 
     'Subject: Howdy\nFrom: Jane Doe\nDate: Fri, 12 Feb 2010 09:58:14 +0200\nLorem ipsum...', 
     'Subject: Re: Howdy\nFrom: Eminem']