Search code examples
character-encodingmimeemail-clientwebmail

Which charcter encoding is used by email client to encode Japanese characters?


I'm analyzing character set used in MIME to combine multiple character set.

For that wrote as sample email as:

This is sample test email 精巣日本 dsdsadsadsads

which is automatically gets convert into:

This is sample test email 精巣日本 dsdsadsadsads

I want to know, which character set encoding is used to encode theses character? Is this possible to use that character set encoding in C?

Email client: Postfix webmail


Solution

  • The purpose of MIME is to allow for support of arbitrary content types and encodings. As long as the content is adequately tagged in the MIME headers, you can use any encoding you see fit. There is no single right encoding for your use case; though in this day and age, the simplest solution by far is to use Unicode for everything.

    In MIME terms, you'd use something like Content-Type: text/plain; charset="utf-8" and then correspondingly encode the body text. If you need the email to be 7-bit safe, you might use a quoted-printable or base64 content-trasfer encoding on top, but any modern MIME library should take care of this detail for you.

    The HTML entities you observed in your experiment are not suitable for plain-text emails, though they are a viable alternative for pure-HTML email. (If your webmail client used them in plaintext emails, it is buggy; it will only work if the sender and recipient both have the same bug.)

    Traditionally, Japanese email messages would use one of the legacy Japanese encodings, like Shift_JIS or ISO-2022-JP. These have reasonable support for English, but generalize poorly to properly multilingual text (though ISO-2022 does somehow support it). With Unicode, by contrast, mixing Japanese with e.g. Farsi, Uzbek, and Turkish is straightforward and undramatic.

    Using UTF-8 from C is easy and basically transparent. See e.g. http://utf8everywhere.org/ for some starting points.