Search code examples
emailmailing-listmailman

Chinese characters displayed as question marks in mailing list?


I subscribed to a mailing list whose content is primarily Chinese. Every time I received the email, every Chinese character is replaced by ?. I then dug into the email headers and found

> Content-Type: text/plain; charset="utf-8"

I think this is the problem, to solve which I need to change the charset to one that is compatible with Chinese. But where do I change this?

I don't think I can change it on my side, as the problem appears regardless of which mail client I use. The mailing list is run on Python mailman.


Solution

  • Content-Type: text/plain; charset="utf-8"

    This is not the issue.

    UTF-8 can display every character in every language. What is very probably actually happening is that the website in question is storing the data in a database which is not true UTF-8, such as MySQL utf8_ collations and character sets which are not true UTF-8.

    If not a database storage issue then the issue comes from the character set used when the email is generated or the data is inserted into the email template, somewhere along the line the origin of the email is not UTF-8 or a corresponding full Chinese character set.

    Such as :

    • HTML input form not set to UTF-8 (or correct Chinese character set),
    • HTML input form container webpage not set to UTF-8, HTML recieving form / code not set to UTF-8 (or correct Chinese character set),
    • Emailer template generator not set to UTF-8 (or correct Chinese character set).
    • Sending server otherwise not using default UTF-8 headers.

    Also while you state "content is primarily Chinese" this doesn't narrow down much as there are at least five main Chinese written languages and a host of smaller languages using the same or very similar character sets.

    You want to have the email constructed using International Resource Identifiers, using the UTF-8 encoding. UTF-8 implements unicode, and in unicode, each character has a codepoint, that is between 0x4E00 and 0x9FFF (2 bytes) for all chinese characters.

    But UTF8 doesn't encode characters by just storing their codepoint (UTF32 does that). Instead, it uses a more complex standard, that makes all chinese ideograms 2 or 3 bytes long.

    For reference: Python Mailman and UTF-8 details (2008 question) and a character conversion guide (2009). Also this Stackoverflow anwser.