Search code examples
character-encodingiconv

How to convert a string from a given encoding to some other encoding?


I am not going to write any software that converts text between different character encodings. iconv exists. I just got curious to know how that can be done while reading this excellent tutorial on Character Encodings. Since, a character in different encodings will have different codepoints, to me it seems there's no automatic way to do it, I mean without any human intervention. Hopefully, OCR is not the way to do it :P.

To clarify, lets say we have a string "Yahoo!" and it's encoding in UTF-8 is E. What encoding E' in latin-1 is also "Yahoo!"? Also, the example I have taken seems simple. How to do this for a general string?


Solution

  • As the excellent author of that excellent tutorial writes (*cough*cough*):

    Converting between encodings is the tedious task of comparing two code pages and deciding that character 152 in encoding A is the same as character 4122 in encoding B, then changing the bits accordingly.

    It really is pages and pages of mind-numbing tables to map one encoding to another. For example, from the iconv source code:

    static const unsigned short cp950_2uni_pagea1[314] = {
      /* 0xa1 */
      0x3000, 0xff0c, 0x3001, 0x3002, 0xff0e, 0x2027, 0xff1b, 0xff1a,
      0xff1f, 0xff01, 0xfe30, 0x2026, 0x2025, 0xfe50, 0xfe51, 0xfe52,
      0x00b7, 0xfe54, 0xfe55, 0xfe56, 0xfe57, 0xff5c, 0x2013, 0xfe31,
      0x2014, 0xfe33, 0x2574, 0xfe34, 0xfe4f, 0xff08, 0xff09, 0xfe35,
      0xfe36, 0xff5b, 0xff5d, 0xfe37, 0xfe38, 0x3014, 0x3015, 0xfe39,
      0xfe3a, 0x3010, 0x3011, 0xfe3b, 0xfe3c, 0x300a, 0x300b, 0xfe3d,
      0xfe3e, 0x3008, 0x3009, 0xfe3f, 0xfe40, 0x300c, 0x300d, 0xfe41,
      0xfe42, 0x300e, 0x300f, 0xfe43, 0xfe44, 0xfe59, 0xfe5a, 0xfe5b,
      0xfe5c, 0xfe5d, 0xfe5e, 0x2018, 0x2019, 0x201c, 0x201d, 0x301d,
      0x301e, 0x2035, 0x2032, 0xff03, 0xff06, 0xff0a, 0x203b, 0x00a7,
      0x3003, 0x25cb, 0x25cf, 0x25b3, 0x25b2, 0x25ce, 0x2606, 0x2605,
      0x25c7, 0x25c6, 0x25a1, 0x25a0, 0x25bd, 0x25bc, 0x32a3, 0x2105,
      0x00af, 0xffe3, 0xff3f, 0x02cd, 0xfe49, 0xfe4a, 0xfe4d, 0xfe4e,
      0xfe4b, 0xfe4c, 0xfe5f, 0xfe60, 0xfe61, 0xff0b, 0xff0d, 0x00d7,
      0x00f7, 0x00b1, 0x221a, 0xff1c, 0xff1e, 0xff1d, 0x2266, 0x2267,
      0x2260, 0x221e, 0x2252, 0x2261, 0xfe62, 0xfe63, 0xfe64, 0xfe65,
      0xfe66, 0xff5e, 0x2229, 0x222a, 0x22a5, 0x2220, 0x221f, 0x22bf,
      0x33d2, 0x33d1, 0x222b, 0x222e, 0x2235, 0x2234, 0x2640, 0x2642,
      0x2295, 0x2299, 0x2191, 0x2193, 0x2190, 0x2192, 0x2196, 0x2197,
      0x2199, 0x2198, 0x2225, 0x2223, 0xff0f,
      ...
    

    Typically it makes sense to map each possible encoding to Unicode. Since Unicode can represent all possible characters, you can go encoding A → Unicode → encoding B, without needing a conversion table between all possible combinations of encodings (which would be n² and rather ridiculous). There's no general procedural shortcut. Some encodings map to Unicode more easily than others, so you can have some pretty short code for those, but all others simply need hand tuning.

    Note that at least some of that source code was likely generated from existing sources; I doubt all those tables are hand-written. Nonetheless, somebody had to write some lookup table at some point.