I am learning a haskell parsing library called parsec and for this purpose I need parse an email message. I've been studying the specs, comparing different messages from different clients, reading some rfc, etc.
For this exercise all I need is to extract "From:" header and the actual plain text body. Now, all clients seem to produce sane or at least non-deviating messages with regards to the specs. The only difference is the outlook (I am not surprised for some reason).
So the standad way, according to myu reading is to have a boundary sequence say:
Content-Type: multipart/alternative; boundary=047d7b2e4e3cdc627304eb094bfe
and then all the parts of the multipart body are delimited by this boundary sequence, right? Please, correct me if I am wrong. I'd like my parser to work with all possible clients.
So the common pattern is
--boundary
headers
part
--boundary
headers
part
...
Now, looking at the outlook generated message, I see a different picture. It uses some kind of sub-boundaries and I don't understand whether it is a standard or not? This is outlooks variant
Content-Type: multipart/related;
type="multipart/alternative";
boundary="----_=_NextPart_001_01CEE199.851D3871"
Then the body is delimited like this
------_=_NextPart_001_01CEE199.851D3871
Content-Type: multipart/alternative;
boundary="----_=_NextPart_002_01CEE199.851D3871"
----_=_NextPart_002_01CEE199.851D3871
headers
body part
----_=_NextPart_002_01CEE199.851D3871
headers
body part
------_=_NextPart_001_01CEE199.851D3871
So it has an outer boundary with sequence 001 and then an inner boundary with sequence 002. So what is this? Is this some kind of microsoft own mime specification or is it in the rfc that I missed? This is more complex to parse.
It's not really a sub-boundary but rather that an multipart section can itself contain multipart content.
This means that you'll have to recursively parse the boundaries, and if the content type is multipart/alternative then it will contain it's own boundary string and parts. The fact that this string is very similar to the other boundary is just outlook's doing. It could have been completely separate.
both
--part
--part
--part
and
--part
--part
--part
--part
--part
are valid structures.
It might be more obvious if outlook made it looks like
Content-Type: multipart/alternative;
boundary="firstmessage"
--firstmessage
content-type: multipart/alternative;
boundary="nestedpart"
--nestedpart
content-type: text/plain
nested body one
--nestedpart
content-type: text/plain
nested body two
--nestedpart--
--firstmessage
headers
second part of first message
--firstmessage--