Search code examples
perlcharacter-encodingipcopen3

IPC::Open3 converting character encoding


I am observing strange behaviour with IPC::Open3 arguments as part of a script.

I give a string containing ISO-8859-15. Just before open3() is called (literally the statement before) the string is correct (verified with print and Data::Dumper).

However once the subprocess is started the arguments are now UTF-8 encoded. I have verified this using the desired executable (freebcp) and a wrapper script. I ended up writing a wrapper script which converts all the arguments back to ISO-8859-15.

What causes this behaviour? LANG is set to en_AU.ISO-8859-15. It works correctly on other hosts. I cannot find any reference to binmode()


Solution

  • I has a string containing ISO-8859-15. Just before open3() is called (literally the statement before) the string is correct (verified with print and Data::Dumper).

    However once the subprocess is started the arguments are now UTF-8 encoded.

    LANG is set to en_AU.ISO-8859-15.

    Perl5 by default doesn't do any encoding conversion: the strings treated as dumb byte arrays.

    That, until you tell Perl that the strings contain the Unicode, for example by calling decode(), or reading string from a file handle that has encoding layer attached (via binmode(), or via open() flags, or via use open with :encoding/:locale, or via command line with -C switch.)

    Since you have the string in ISO-8859-15, but it is outputted in UTF-8, that means that the Perl is aware of the encoding of your string. Somewhere somehow you have told Perl the encoding of the string, and it has converted it to the Unicode, which is internally represented using the UTF-8. The UTF-8 which now seems to be printed to the open3() file handles.

    As a possible solution, before outputting the strings, you should try to explicitly convert the strings into the desired encoding.

    P.S. Using the utf8::is_utf8() function, you can try to debug/find when/how your strings get converted into the Unicode, and whether they are really Unicode.