Search code examples
c#regexemail-parsing

Parsing email responses using Regex


I was trying to use the solution provided in the following link to parse email responses programmatically: Parse email content from quoted reply

it works fine in most cases except for gmail and outlook. It also picks the sender line:
On Sun, Mar 31, 2013 at 10:57 AM, < [email protected]> wrote:

I do not understand regex much, but the following one should have parsed it correctly:

new Regex("From:\\s*" + Regex.Escape(address), RegexOptions.IgnoreCase)
new Regex("\\n.*On.*(\\r\\n)?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline)

Sample Data:
Do read it.\r\n\r\n\r\nOn Sun, Mar 31, 2013 at 10:57 AM, <\r\n [email protected] > wrote:\r\n\r\n>

Expected Outcome:
Do read it.

Current Outcome:
Do read it. On Sun, Mar 31, 2013 at 10:57 AM, wrote:


Solution

  • Use a capturing group to get a part of this match:

    new Regex("\\n(.*)[\\r\\n]*On(?:.|\\r|\\n)*?wrote:\\r\\n", RegexOptions.IgnoreCase | RegexOptions.Multiline)
    

    Also, use lazy operators instead of greedy ones: .* => .*?
    The provided link will tell you why.

    Edit: As my comment specifies, \r and \n won't be matched by dots. It also says that suggesting you to use lazy operators was pretty stupid though I'll let it because it's still knowledge worth having for the future.

    Edit2: In fact it was not for the second part on the regex. Edited.