Search code examples
emailmimemime-messageemail-parsingmime-mail

Parsing mime email, outlook problems and differences


I am learning a haskell parsing library called parsec and for this purpose I need parse an email message. I've been studying the specs, comparing different messages from different clients, reading some rfc, etc.

For this exercise all I need is to extract "From:" header and the actual plain text body. Now, all clients seem to produce sane or at least non-deviating messages with regards to the specs. The only difference is the outlook (I am not surprised for some reason).

So the standad way, according to myu reading is to have a boundary sequence say:

Content-Type: multipart/alternative; boundary=047d7b2e4e3cdc627304eb094bfe

and then all the parts of the multipart body are delimited by this boundary sequence, right? Please, correct me if I am wrong. I'd like my parser to work with all possible clients.

So the common pattern is

--boundary
headers
part

--boundary
headers
part

...

Now, looking at the outlook generated message, I see a different picture. It uses some kind of sub-boundaries and I don't understand whether it is a standard or not? This is outlooks variant

Content-Type: multipart/related;
    type="multipart/alternative";
    boundary="----_=_NextPart_001_01CEE199.851D3871"

Then the body is delimited like this

------_=_NextPart_001_01CEE199.851D3871
Content-Type: multipart/alternative;
    boundary="----_=_NextPart_002_01CEE199.851D3871"

----_=_NextPart_002_01CEE199.851D3871
headers
body part

----_=_NextPart_002_01CEE199.851D3871
headers
body part

------_=_NextPart_001_01CEE199.851D3871

So it has an outer boundary with sequence 001 and then an inner boundary with sequence 002. So what is this? Is this some kind of microsoft own mime specification or is it in the rfc that I missed? This is more complex to parse.


Solution

  • It's not really a sub-boundary but rather that an multipart section can itself contain multipart content.

    This means that you'll have to recursively parse the boundaries, and if the content type is multipart/alternative then it will contain it's own boundary string and parts. The fact that this string is very similar to the other boundary is just outlook's doing. It could have been completely separate.

    both

    --part
    --part
    --part
    

    and

    --part
      --part
      --part
    --part
    --part
    

    are valid structures.

    It might be more obvious if outlook made it looks like

    Content-Type: multipart/alternative;
        boundary="firstmessage"
    
    --firstmessage
    content-type: multipart/alternative;
        boundary="nestedpart"
    
    --nestedpart
    content-type: text/plain
    
    nested body one
    
    --nestedpart
    content-type: text/plain
    
    nested body two
    
    --nestedpart--
    --firstmessage
    headers
    
    second part of first message
    --firstmessage--