Search code examples
htmlbashcommand-lineutf-8

bash: convert html entities to UTF-8, but keep existing UTF-8


Just like this question, I need to convert html entities (e.g. &) to UTF-8 (&) while ignoring other UTF-8 characters. The difference is that in my case, I need to do this via the bash command line.

I can use a tool like recode and run echo '&' | recode html..utf-8 which converts over to & just fine, however with UTF-8 characters in the string, like in

echo 'Arabic & ٱلْعَرَبِيَّة' | recode html..utf-8

I get:

Arabic & Ù±ÙÙعÙرÙبÙÙÙÙØ©

which, naturally, is not what I need. It should look like this at the end:

Arabic & ٱلْعَرَبِيَّة

Is there a way to do this without a bunch of messy and seemingly endless regex? Thanks


Solution

  • perl one-liner:

    $ echo 'Arabic & ٱلْعَرَبِيَّة' | perl -CS -MHTML::Entities -ne 'print decode_entities($_)' 
    Arabic & ٱلْعَرَبِيَّة
    

    Requires the HTML::Entities module, which is part of the larger HTML::Parser bundle. Install through your OS package manager or favorite CPAN client.