Search code examples
encodingutf-8iconvbig5

How to convert a Big5 encoded txt file to UTF8 encoded txt file?


I have a Big5 encoded file, which can't be opened by Mac TextEdit. I wonder how to convert the whole file into utf8 encoding, since utf8 is much more universal and common.

I have tried using iconv in my terminal, but it does not work. I can't find anything useful about this error by Google either.

$ iconv -f BIG5 -t UTF8 in.txt > out.txt
iconv: in.txt:5:0: cannot convert

Are there any other ways to convert?

I got the txt file from here, whcih is a list of Chinese names writing in Taiwan Traditional Chinese.


Solution

  • Looking at the first 20 lines of your file, it is clear that the encoding uses the byte 0x8C as first byte of some multibyte sequences. The encodings that have this property are:

    • BIG5
    • BIG5-HKSCS
    • CP932
    • CP936
    • CP949
    • CP950
    • GB18030
    • GBK
    • JOHAB
    • Shift_JIS
    • Shift_JISX0213

    Try them in turn:

    $ for encoding in BIG5 BIG5-HKSCS CP932 CP936 CP949 CP950 GB18030 GBK \
                      JOHAB Shift_JIS Shift_JISX0213; do \
      if head -n 20 < unique_names_2012.txt | iconv -f $encoding -t UTF-8 > /dev/null 2> /dev/null; then \
        echo $encoding ; \
      fi; \
    done
    

    With GNU libiconv, it prints

    BIG5-HKSCS
    CP950
    GB18030
    

    Is it in GB18030 encoding?

    $ iconv -f GB18030 < unique_names_2012.txt
    

    shows hundreds of lines that use characters in the PUA range. While not impossible, it seems unlikely.

    Is it in CP950 encoding?

    $ iconv -f CP950 < unique_names_2012.txt
    

    gives a conversion error at line 2294.

    Is it in BIG5-HKSCS encoding?

    $ iconv -f BIG5-HKSCS < unique_names_2012.txt
    

    gives a conversion error at line 713.

    So, most probably the file is encoded in a variant of BIG5. There are many such variants, see http://haible.de/bruno/charsets/conversion-tables/Chinese.html. Possibly it's an extension of CP950 or an extension of BIG5-HKSCS (since these are the most popular encodings from the BIG5 family today).

    In summary, such conversion errors are caused by unstandardized proliferation of BIG5 variants.

    The best thing you can do is to request the original file in UTF-8 encoding; let the originator deal with it.