Search code examples
emailemail-validationionosrfc5322

Email with special characters rejected - RFC-6532 and "quoted-printable"


One email provider rejected an email containing special characters (e.g. umlaute). They say that they are RFC-5321 and RFC-5322 compliant. Now I browsed those standards however they are not supporting international emails (thus no umlaute). Only ASCII-127 is supported. Now there is an extension called RFC-6532 which standardizes international emails. Our emails are UTF-8 (quoted-printable) encoded and sent like this:

"=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<[email protected]>

Is this an RFC-6532 compliant address? Or is it some other/older RFC (like RFC-2054)? After all there are so many mail related RFCs that I might have missed 10 or 20 ;-)


Solution

  • It's on the right track, but it's wrong.

    "=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<[email protected]>
    

    There are 2 problems with the above form:

    1. The encoded-word (the =?UTF-8?Q?...?= bit) is quoted and shouldn't be. Mail software that parse this address won't decode that token if they are standards-compliant.
    2. The "name" is butted up against the angle brackets and should not be. There MUST be a space in order to be standards compliant.

    In other words, this is what it should look like:

    =?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= <[email protected]>
    

    The RFCs that you need to look at are:

    • RFC5322 - this defines the modern Message syntax that is implemented by the server you are trying to interoperate with.
    • RFC2047 - this defines the methods and syntax of the encoded-words that are needed to represent non-ASCII characters in headers like Subject and address headers (e.g. To/From/Cc/Reply-To/etc). (This is the =?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= part)
    • RFC822 - this defines the grammar used by RFC2047 and is an older version of RFC5322.

    It may also be helpful to read RFC2822 which is newer than RFC822 but older than RFC5322. My guess, however, is that you can skip it because it won't have a lot of value. The only reason RFC822 still has value is because of its much older grammar definitions that are referenced by RFC2047 (such as atom, dot-atom, phrase, angle-addr, addr-spec, tspecials, etc).

    RFC6532 is even newer than RFC5322. The purpose of which is to remove the need to encode headers altogether by allowing the use of UTF-8 as an alternative.

    Before RFC6532, there was no standard for the character encoding to use for headers other than ASCII (which was what RFC822 used) and so everything was always supposed to conform to ASCII.

    A lot of software doesn't follow the standards, however, and so there was a lot of mail in the real world that used ISO-8859-1 and every other character encoding under the sun, all depending on what region the user(s) were in and what character encoding(s) were in wide use in those regions (e.g. Big5 and GB2312 are popular in various parts of China, Shift-JIS being popular in Japan, EUC-KR/KS-C-5601-1987 are popular in Korea, etc).

    This caused major interoperability problems, though, not least of which because not every mail client could handle every character encoding under the sun, but also because there was no way for a client to figure out which character encoding was even being used! It's all just binary gobbeldy-gook.

    UTF-8, however, has existed for a long time and it can represent all characters in all languages, so it was only logical for it to eventually win out as the standard character encoding to use for international email.