Search code examples
encodingutf-8base64email-headersmime-mail

MIME email Subject etc. headers vs. utf8: first split, then encode?


Let's take this Subject line,

$ echo -n 台電用戶意見電子信箱-信件受 | base64
5Y+w6Zu755So5oi25oSP6KaL6Zu75a2Q5L+h566xLeS/oeS7tuWPlw==

It (along with "Subject:" etc.) exceeds the limits when encoded. So, some mailers (a certain power company's) first encode it, then split it:

Subject: =?utf-8?B?5Y+w6Zu755So5oi25oSP6KaL6Zu75a2Q5L+h566xLeS/oeS7?=
 =?utf-8?B?tuWPl+eQhumAmuefpQ==?=

(But that might easily "fracture" a UTF-8 multibyte character.)

Other mailers (e.g., Gnus) first split it, then encode it:

Subject: =?utf-8?B?5Y+w6Zu755So5oi25oSP6KaL6Zu75a2Q5L+h566xLeS/oeS7tg==?=
 =?utf-8?B?5Y+X55CG6YCa55+l?=

The latter is guaranteed to be rendered correctly in all mail readers of today.

My question is, who is at fault that some mail readers (e.g., Gmail android app) choke on the former?

Should mail readers always first paste the two strings together, then decode? (So Gmail app is wrong.)

Or is it also OK to first decode, then paste the two decoded strings together. (So the mailer software is wrong?)

(I assume the same issue occurs for Quoted Printable too, not only Base64.)

Broken UTF-8

Indeed, if you think about it, Saying =?utf-8?B?...?= means that ... stuff should be a valid UTF-8 string, (on its own,) right? So the mailer software is wrong!

Likewise, there probably has never been a syntax defined for how to split =?utf-8?B?...?= into two phrases, as that should have been taken care of beforehand, as creating the =?utf-8?B?...?= string should always be the final step.

So: Mailer software: GUILTY. Gmail: NOT GUILTY.

2021/09/01: here I analyze the first line of the Subject


Solution

  • As per RFC 2047 § 8's examples (and the overall explanation) an encoded-word does not magically span over several instances:

    • =?UTF-8?Q?a?= neither continues a previous encoded-word, nor can it be continued with a following encoded-word - it is, what it is: a.
    • It is more obvious when we mix text encodings: =?UTF-8?Q?a?= =?ISO-8859-1?Q?b?= should render as ab, and it is clear that cutting UTF-8 inbetween would only halfway work when the next encoded-word is UTF-8 again (while a different text encoding surely uses different bytes).

    As a logical consequence UTF-8 should be splitted by characters, not bytes. Which means: both encoding B (Base64) and Q (Quoted) should not be cut (unless the cut is coincidentially also between the encoded text's characters) - the cutting must occur before.

    I can only guess this is "too complicated" for a few programmers and they just think "it won't break anything anyway - so far nobody complained". But if an encoded-word must be cut, the proper way is to first decode it so that the text can be cut character-wise (instead of byte-wise), and then to encode both parts again. One caveat is: who does so must also support said text encoding - while UTF-8 is widespread today, would a software also know where to cut Shift-JIS and Big5 and UTF-16BE?