Search code examples
c#regexwindows-runtimemime

Regex pattern to parse Mime sections in c# / winrt


Going batshit crazy trying to make this work this morning.

I'm using the boundary string of an email to try and split it into text/plain and text/html parts. I know there are libraries out there to do this but none of them work in WinRT.

Here's what I have. I suck at regex, so it's probably all sorts of wrong:

Sample Data

From: Rory <me@gmail.ftw>
Date: Mon, 8 Oct 2012 17:05:48 +0100
Message-ID: <a1b2c3d4e5f6g7h8i9j10a1b2c3d4e5f6g7h8i9j10@mail.gmail.ftw>
Subject: Subject of my email
To: me@gmail.ftw

Content-Type: multipart/alternative; boundary=bcaec54fbd3a824f3504cb8e677d

--bcaec54fbd3a824f3504cb8e677d

Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

text part of email

--bcaec54fbd3a824f3504cb8e677d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<html>
    <strong>HTML part of email</strong>
</html>

--bcaec54fbd3a824f3504cb8e677d--

I'm trying to extract

  1. both sections between the --bcaec54fbd3a824f3504cb8e677d boundary marker
  2. The Content-Type, charset and Content-Transfer-Encoding of each of those sections
  3. The Content itself (below the Content-Transfer-Encoding, until the next boundary

Regex Code

string b = "bcaec54fbd3a824f3504cb8e677d";
Regex r = new Regex(
"(--" + b + "\r?\nContent-Type: (text/plain|text/html); charset=(.+?)\r?\nContent-Transfer-Encoding: (.+?)\r?\n(.*?--" + b + "))", 
RegexOptions.Singleline); 

This matches both parts only if I leave out the boundary string at the end. If I include it, it only matches the first part. Can someone please help me before I start smashing things

UPDATE: Added Sample data, reduced


Solution

  • Use regex pattern

    "(--" + b + "(?:\r?\n)+Content-Type:\s+([^;]+);\s+charset=([^\s\n\r]+)(?:\r?\n)+Content-Transfer-Encoding:\s([^\s\n\r]+)(?:\r?\n){2,}.*?)(?=\r?\n--" + b + "(?:--)?\r?\n)"
    

    with RegexOptions.Singleline option/flag.