Search code examples
bashcsvubuntuutf-8thai

How do I manipulating CSVs containing unicode (Thai) characters using bash?


I've got an Adwords dump containing Thai keywords which I'll use for a join with data from another DB.

In theory, I grab the file, snip off the useless lines at the top and bottom, clean it up a little and upload it to PostgreSQL as a new table.

In practice, the characters get garbled on the way (actually, from the start) even though the file opens fine in Excel and OpenOffice. The below is true on both my local machine (running OSX) and the server (running Ubuntu).

First, I already set my locale to UTF-8:

 $ echo "กระเป๋า สะพาย คอนเวิร์ส"
 กระเป๋า สะพาย คอนเวิร์ส

However, looking at the CSV (let's assume it only contains the above string) on the CLI gives me this:

$ head file.csv    
#0@2 *02" -@'4#L* 

Any idea where the problem is?


Solution

  • The original file was in the wrong encoding.

    $ file file.csv
    file.csv: Little-endian UTF-16 Unicode English text
    

    Quick fix:

    $ iconv -f UTF-16 -t UTF-8 file.csv
    $ head file.csv
    กระเป๋า สะพาย คอนเวิร์ส