Search code examples
utf-8character-encodingiconv

Force encode from US-ASCII to UTF-8 (iconv)


I'm trying to transcode a bunch of files from US-ASCII to UTF-8.

For that, I'm using iconv:

iconv -f US-ASCII -t UTF-8 file.php > file-utf8.php

My original files are US-ASCII encoded, which makes the conversion not happen. Apparently it occurs because ASCII is a subset of UTF-8...

iconv US ASCII to UTF-8 or ISO-8859-15

And quoting:

There's no need for the textfile to appear otherwise until non-ASCII characters are introduced

True. If I introduce a non-ASCII character in the file and save it, let's say with Eclipse, the file encoding (charset) is switched to UTF-8.

In my case, I'd like to force iconv to transcode the files to UTF-8 anyway. Whether there is non-ASCII characters in it or not.

Note: The reason is my PHP code (non-ASCII files...) is dealing with some non-ASCII string, which causes the strings not to be well interpreted (french):

Il était une fois... l'homme série animée mythique d'Albert

Barillé (Procidis), 1ère

...

  • US ASCII -- is -- a subset of UTF-8 (see Ned's answer below)
  • Meaning that US ASCII files are actually encoded in UTF-8
  • My problem came from somewhere else

Solution

  • ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.

    It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.

    For example, if they are using a latin1 encoding, then run:

    iconv -f latin1 -t UTF-8 filename.txt > filename-utf8.txt
    

    There are many tools that can help detect the character encoding, such as:

    $ uchardet filename.txt
    ISO-8859-1