Search code examples
emailtextencoding

Plain text email not able to display non-ASCII characters?


I recently had an email exchange with an organisation's customer service representative about a non-ASCII character in an email from that organisation. This character is correctly stored in the organisation's online system, but when retrieved from there and sent to me by email, the character seems to get converted to a different representation.

Viewing the customer service representative's email in Thunderbird 78.7.1 on Linux, the character that should be 'ü' shows up as a white question mark on a black diamond background: '�'

The character also shows up thus in Gmail in my web browser (Firefox 88.0.1 on Linux), so it doesn't look like it is related to Thunderbird being my email client program.

When I asked whether the system in which this character is being processed limits characters to the ASCII character set, I was given the reply:

"This confirmation was sent in plain text which does not display special characters."

I am pretty sure that even plain text email can display non-ASCII characters, and that what matters is the character encoding, where Unicode would allow all manner of characters to be displayed correctly, as long as the user agent displaying the text has a corresponding font installed.

As to the funny character �: when viewing the message source, the text contains the following character sequence at the corresponding position: =EF=BF=BD

What kind of encoding is that, and how is it possible for 'ü' to get translated to "=EF=BF=BD"?


Edited: Here is an anonymized excerpt of the email source with the parts related to encoding and the mentioned funny character on the last line:

...
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Mailer: ColdFusion 2016 Application Server

Your contact information on file is as follows:                            =
  =20
G=EF=BF=BDld...

Solution

  • The sequence represents the Unicode "unknown character" glyph U+FFFD so it's either incorrectly stored in their database, or incorrectly encoded by the software which sent the email.

    Indeed, a MIME part with Content-Type: text/plain can optionally have a charset parameter to select a different character set than the legacy pre-MIME default of 7-bit US-ASCII. This facility has been part of email for much longer than either Thunderbird or Gmail has existed, and by your description, their own message was clearly sent as UTF-8 with quoted-printable encoding.

    You can easily demonstrate to them that their email client can display a mixture of Chinese, Arabic, Cyrillic, Hebrew, and Indic characters simply by sending them one.

    Subject: demo text/plain message
    Mime-Version: 1.0
    Content-type: text/plain; charset="utf-8"
    Content-transfer-encoding: quoted-printable
    
    Here's =C3=BC, and here's =EF=BF=BD
    

    should display as

    Here's ü, and here's �

    You can add other characters from the full Unicode repertoire by looking up their code point and encoding it as quoted-printable; it should not be hard to find web sites which easily let you type in arbitrary text and have it thusly encoded.

    On a Unix-like system with outbound email properly configured, you can send this message by storing it in a text file, then

    sendmail -oi [email protected] <filename
    

    Unicode specifically assigns U+FFFD as the glyph to use when the proper glyph cannot be encoded for whatever reason. We can speculate that their email system or the bridge from their database was implemented by a junior developer with a limited understanding of email or Unicode, or even both. A proper implementation would store the database as UTF-8 and simply extract the information verbatim; but some legacy database platforms require non-ASCII strings to be stored in some proprietary or legacy format.

    The central pertinent IETF standards governing this are RFC5322 for the basic format of email messages, and RFCs 2045 trough 2048 which describe MIME (the full set is not necessary; 2045 is the centerpiece, 2046 describes different content types, and 2047 provides a special notation specifically for email header values). Wikipedia has an article which describes and discusses the Unicode replacement character.