Search code examples
phpperlutf-8iso-8859-1

How can I convert a bunch of files from ISO-8859-1 to UTF-8 using Perl?


I have several documents I need to convert from ISO-8859-1 to UTF-8 (without the BOM of course). This is the issue though. I have so many of these documents (it is actually a mix of documents, some UTF-8 and some ISO-8859-1) that I need an automated way of converting them. Unfortunately I only have ActivePerl installed and don't know much about encoding in that language. I may be able to install PHP, but I am not sure as this is not my personal computer.

Just so you know, I use Scite or Notepad++, but both do not convert correctly. For example, if I open a document in Czech that contains the character "ž" and go to the "Convert to UTF-8" option in Notepad++, it incorrectly converts it to an unreadable character.

There is a way I CAN convert them, but it is tedious. If I open the document with the special characters and copy the document to Windows clipboard, then paste it into a UTF-8 document and save it, it is okay. This is too tedious (opening every file and copying/pasting into a new document) for the amount of documents I have.

Any ideas? Thanks!!!


Solution

  • If the character 'ž' is included then the encoding is definitely not ISO-8859-1 ("Latin 1") but is probably CP1252 ("Win Latin 1"). Dealing with a mix of UTF8, ISO-8859-1 and CP1252 (possibly even in the same file) is exactly what the Encoding::FixLatin Perl module is designed for.

    You can install the module from CPAN by running this command:

    perl -MCPAN -e "install 'Encoding::FixLatin'"
    

    You could then write a short Perl script that uses the Encoding::FixLatin module, but there's an even easier way. The module comes with a command called fix_latin which takes mixed encoding on standard input and writes UTF8 on standard output. So you could use a command line like this to convert one file:

    fix_latin <input-file.txt >output-file.txt
    

    If you're running Windows then the fix_latin command might not be in your path and might not have been run through pl2bat in which case you'd need to do something like:

    perl C:\perl\bin\fix_latin.pl <input-file.txt >output-file.txt
    

    The exact paths and filenames would need to be adjusted for your system.

    To run fix_latin across a whole bunch of files would be trivial on a Linux system but on Windows you'd probably need to use the powershell or similar.