Search code examples
javascriptregexemailmime

Parse text/html part of email source using Javascript


Using javascript, I need to parse the Content-Type text/html portion of an email message and extract just the HTML part. Here's an example of the part of the mail source in question:

------=_Part_1504541_510475628.1327512846983
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit


<html ... a bunch of html ...

/html>

I want to extract everything between (and including) the <html> tags after text/html. How do I do this?

NOTE: I'm OK with a hacky regex. I don't expect this to be bulletproof.


Solution

  • Based on RFC/MIME documentation, the encapsulation boundary is defined as a line consisting entirely of two hyphen characters ("-", decimal code 45) followed by the boundary parameter value from the Content-Type header field.

    Note: In JavaScript there is indeed no /s modifier to make the dot . match all characters, including line breaks. To match absolutely any character, you can use character class that contains a shorthand class and its negated version, such as [\s\S].


    Regex:

    \n--[^\n\r]*\r?\nContent-Type: text\/html[\s\S]*?\r?\n\r?\n([\s\S]*?)\n\r?\n--
    

    JavaScript:

    matches = /\n--[^\n\r]*\r?\nContent-Type: text\/html[\s\S]*?\r?\n\r?\n([\s\S]*?)\n\r?\n--/gim.exec(mail);