Search code examples
xmlemailencodingutf-8mime

Confused about Content-Transfer-Encoding when emailing an XML file as an attachment


I have a UTF-8 encoded XML file which is emailed as an attachment. When the email recipient opens the email and saves the attachment, the XML file is no longer UTF-8 (it's instead reporting ANSI encoding). In this instance, the recipient used Microsoft Outlook, if it matters.

I am programming this in an environment where I cannot rely on the availability of suitable MIME libraries, so I need to understand where I am going wrong.

Before emailing the XML file, after creating it on the server, I can see using the Linux file command that it's a UTF-8 file. Separate to this, the XML also has a version header <?xml version="1.0" encoding="UTF-8"?> (which isn't really relevant to my problem, but I'm including it for completeness). I'm pretty sure that my code which emails the file is the problem here, but I'm uncertain as to the "right" way to do this.

The headers I'm sending are:

"Mime-Version" "1.0"
"Content-Type" "multipart/mixed; boundary="__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___"\n\n"

The body of the email is:

--__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___\n
Content-Type: text/plain; charset="utf-8"; format=flowed\n
Content-Transfer-Encoding: 7bit\n\n
Please find attached the data file generated 
--__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___\n
Content-Type: text/plain; charset="utf-8"\n
Content-Disposition: attachment; filename="My_File_Name"\n\n
XML FILE CONTENTS GO HERE
--__==NAHDHDH2.28ABSDJxjhkjhsdkjhd___--\n

Questions:

  • should I be using quoted-printable, 8bit or other type of Content-Transfer-Encoding here? I have tried all of them, but it hasn't changed the result.
  • Is Content-Type: text/plain correct for an XML attachment?
  • Any other suggestions?

Solution

  • By specifying text/plain you basically surrender control to the remote client's text-handling abilities, which are apparently limited in this particular case. XML is Unicode by spec, so by choosing a better content-type, you are more likely to succeed. Try text/xml or application/xml instead, or even the completely opaque application/octet-stream, which should only allow the recipient to save it on disk in byte-for-byte identical form.

    The content transfer encoding should not affect this behavior at all, but since you seem to be unclear on its significance, here is a brief discussion.

    The content-transfer-encoding is completely transparent; it will not affect what is delivered or what the remote client can do with it. Which content transfer encoding to choose depends on the nature of your data and the capabilities of the email system which it needs to be transported through. If it's not 8-bit clean, you need a 7-bit CTE to encapsulate it into. If the content has lines which are too long to fit into SMTP, it needs to be encapsulated into something with shorter lines. But the remote client will extract whatever is inside the encapsulation at the other end. Use whatever circumstances dictate.

    There is a hierarchy of content transfer encodings for different circumstances:

    • 7bit is appropriate if your data is completely 7-bit ASCII and has no lines longer than approximately 990 characters. Then it can survive even a crude old SMTP transfer without modification. In the absence of any explicit Content-Transfer-Encoding: header, this is the default according to the standards (although you frequently see stuff with 8-bit data in it without an explicit CTE, or even with an explicit 7bit declaration).

    • 8bit relaxes the requirement for the data to be 7-bit clean. If all systems which transport this message support the ESMTP 8BITMIME extension, this should be fine for data with restricted line lengths.

    • binary additionally allows for unlimited line length. In theory, you should be able to use this to pass through unrestricted content, but in practice, this seems to trigger glitches when systems don't strictly adhere to specifications. A typical symptom is that overlong lines are truncated or folded in transit, violating the integrity of the payload. To avoid problems like that (and to better adhere to the letter and the spirit of the standards for interoperability) you're better off with one of the following.

    • base64 accepts unrestricted content, but encodes it in a format which meets strict requirements for restricted line length and a severely constrained 7-bit character repertoire. It expands the payload to a bit more than 4/3 of the original size. Example:

        ugqcA7R5cPq667vNaSifRUH9HsW00NqZ1gwICk0pNrUkXFpNIFOpbf3o
        5ml8cqqSygkp8KBgPbHrqnDXvZTEBOkNo7ThE+BAvexa75Tm0Ebo/Yjl
        y697pMp1+dnSlk3YTqxkPI9vqpple13dXLHlvnFDmSi0gqIMSwo7kUFD
        SivAWhyCBR6tFO3lY1Pk6lz78+zgL28VthI72kVRkrWWtzoFef/4u5Ip
        GR00CtsNNEJo01GAQGpkTNFT9U9Q/UI9CMGgaI9E9RkMaTDTQICBEyaE
        woSCQOrNGA==
    
    • quoted-printable similarly accepts arbitrary content, but encodes selected bytes to 3x the original. When most of the input is ASCII, this is a tolerable amount of overhead. In other words, this is suitable for roughly textual format with occasional non-ASCII content, such as text in many Western languages using an 8-bit encoding, or formats like HTML where the ASCII markup dominates over the actual content, in pretty much any language. Example:
        <?xml version=3D"1.0" encoding=3D"UTF-8"?>h=C3=ABll=C3=B6 =
        w=C3=B6rld
    

    Quoted printable is not hard to implement at all, and would seem suitable for your scenario.

    All of this is codified in the MIME RFCs 2045 through 2048. Wikipedia has nice readable articles about e.g. base64 and quoted-printable.

    It's not clear from your description whether you just declared your content to be quoted-printable, or actually encoded it. I've seen people do the former and act surprised when it didn't work, but hope you did the latter. Just a cautionary tale.