Search code examples
c#regexemail-parsing

Parse email header with Regex in C#


I've got a webhook posting to a form on my web application and I need to parse out the email header addresses.

Here is the source text:

Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: "Lastname, Firstname" <[email protected]>
To: <[email protected]>, [email protected], [email protected]
Cc: <[email protected]>, [email protected]
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]

I'm looking to pull out the following:

<[email protected]>, [email protected], [email protected]

I'm been struggling with Regex all day without any luck.


Solution

  • Contrary to some of the posts here I have to agree with mmutz, you cannot parse emails with a regex... see this article:

    https://www.rfc-editor.org/rfc/rfc2822#section-3.4.1

    3.4.1. Addr-spec specification

    An addr-spec is a specific Internet identifier that contains a locally interpreted string followed by the at-sign character ("@", ASCII value 64) followed by an Internet domain.

    The idea of "locally interpreted" means that only the receiving server is expected to be able to parse it.

    If I were going to try and solve this I would find the "To" line contents, break it apart and attempt to parse each segment with System.Net.Mail.MailAddress.

        static void Main()
        {
            string input = @"Thread-Topic: test subject
    Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
    From: ""Lastname, Firstname"" <[email protected]>
    To: <[email protected]>, ""Yes, this is valid""@[emails are hard to parse!], [email protected], [email protected]
    Cc: <[email protected]>, [email protected]
    X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]";
    
            Regex toline = new Regex(@"(?im-:^To\s*:\s*(?<to>.*)$)");
            string to = toline.Match(input).Groups["to"].Value;
    
            int from = 0;
            int pos = 0;
            int found;
            string test;
            
            while(from < to.Length)
            {
                found = (found = to.IndexOf(',', from)) > 0 ? found : to.Length;
                from = found + 1;
                test = to.Substring(pos, found - pos);
    
                try
                {
                    System.Net.Mail.MailAddress addy = new System.Net.Mail.MailAddress(test.Trim());
                    Console.WriteLine(addy.Address);
                    pos = found + 1;
                }
                catch (FormatException)
                {
                }
            }
        }
    

    Output from the above program:

    [email protected]
    "Yes, this is valid"@[emails are hard to parse!]
    [email protected]
    [email protected]