Search code examples
bashutf-8iconv

iconv - Transliterate if possible, otherwise leave unconverted


Consider the following example line of text:

α Arietis, called Hamal, is the brightest star in Aries. Its traditional name is derived from the Arabic word for “lamb” or “head of the ram” (ras al-hamal).

It has three different UTF-8 characters, the α, a left smart quote, and a right smart quote.

My goal is to transliterate as much as possible from UTF-8 to regular ASCII, but leave any non-convertible characters as-is. (In the above sample text, the smart quotes can be transliterated to ", but the α cannot.)

My current command is:

iconv -f UTF-8 -t ASCII//TRANSLIT < iconv.sample

However, it fails to convert the α and terminates with iconv: (stdin):1:0: cannot convert.
If I add //IGNORE to the target or use the -c option, it drops the α altogether.

How can I transliterate if possible, but fallback to the original input character if not?


Solution

  • I'm not sure it's possible when using iconv, as the output encoding will have to be conformed to (that is, if you specify ASCII, it's only going to spit out ASCII, no matter what).

    If you have uconv available, you can specify transliteration operations away from output encoding:

    uconv -f "UTF-8" -t "UTF-8" -x "Latin-ASCII"
    

    As an example:

    $ echo "α Arietis “head of the ram”" | uconv -f "UTF-8" -t "UTF-8" -x "Latin-ASCII"
    α Arietis "head of the ram"