Search code examples
windowsperlutf-8

What is the reason for this bizarre issue parsing a UTF-8 command line argument on Windows?


I am trying to pass in a string that uses the UNICODE character: "right single quotation mark" Decimal: 8217 Hex: \x{2019}

Perl is not receiving the character correctly. Let me show you the details:

Perl Script follows (we'll call it test.pl):

use warnings;
use strict;
use v5.32;
use utf8; # Some UTF-8 chars are present in the code's comments

# Get the first argument
my $arg=shift @ARGV or die 'This script requires one argument';

# Get some env vars with sensible defaults if absent
my $lc_all=$ENV{LC_ALL} // '{unset}';
my $lc_ctype=$ENV{LC_CTYPE} // '{unset}';
my $lang=$ENV{LANG} // '{unset}';

# Determine the current Windows code page
my ($active_codepage)=`chcp 2>NUL`=~/: (\d+)/;

# Our environment
say "ENV: LC_ALL=$lc_all LC_CTYPE=$lc_ctype LANG=$lang";
say "Active code page: $active_codepage"; # Note: 65001 is UTF-8

# Saying the wrong thing, expected: 0’s    #### Note: Between the '0' and the 's'
#   is a "right single quotation mark" and should be in utf-8 => 
#   Decimal: 8217 Hex: \x{2019}
# For some strange reason the bytes "\x{2019}" are coming in as "\x{92}" 
#   which is the single-byte CP1252 representation of the character "right 
#   single quotation mark"
# The whole workflow is UTF-8, so I don't know where there is a CP1252 
#   translation of the input argument (outside of Perl that is)

# Display the value of the argument and its length
say "Argument: $arg length: ",length($arg);

# Display the bytes that make up the argument's string
print("Argument hex bytes:");
for my $chr_idx (0 .. length($arg)-1)
{
  print sprintf(' %02x',ord(substr($arg,$chr_idx,1)));
}
say ''; # Newline

I run the Perl script as follows:

V:\videos>c:\perl\5.32.0\bin\perl test.pl 0’s

Output:

ENV: LC_ALL=en-US.UTF-8 LC_CTYPE={unset} LANG=en_US.UTF-8
Argument: 0s length: 3
Argument hex bytes: 30 92 73

OK, perhaps we also need to specify UTF-8 everything (stdin/out/err and command line args)?

V:\videos>c:\perl\5.32.0\bin\perl -CSDA test.pl 0’s

Output:

ENV: LC_ALL=en-US.UTF-8 LC_CTYPE={unset} LANG=en_US.UTF-8
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 30 92 73

OK, let's try completely remove all LC*/LANG env vars, resulting in:

@SET LC_ALL=
@SET LANG=

@REM Proof that everything has been cleared
@REM Note: The caret before the vertical bar escapes it,
@REM       because I have grep set up to run through a
@REM       batch file and need to forward args
@set | grep -iP "LC^|LANG" || echo %errorlevel%

Output:

1

Let's try executing the script again, with UTF-8:

V:\videos>c:\perl\5.32.0\bin\perl -CSDA 0’s

Output (no change, other than that the LC*/LANG env vars have been cleared):

ENV: LC_ALL={unset} LC_CTYPE={unset} LANG={unset}
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 30 92 73

At this point, I decided to go outside of Perl and see what Windows 10 itself is doing with my command line argument. I have a little utility I wrote in C# a while back that helps troubleshoot command line argument issues and used that to test. The output should be self explanatory:

V:\videos>ShowArgs 0’s

Filename: |ShowArgs.exe|
Pathname: |c:\bin\ShowArgs.exe|
Work dir:  |V:\videos|

Command line: ShowArgs  0’s

Raw command line characters:

000: |ShowArgs  |: S (083:53) h (104:68) o (111:6F) w (119:77) A (065:41) r (114:72) g (103:67) s (115:73)   (032:20)   (032:20)
010: |0’s       |: 0 (048:30) ’ (8217:2019) s (115:73)

Command line args:

00: |0’s|

This shows several things:

  1. The argument passed in does not need to be quoted (I didn't think it would)
  2. The argument is being correctly passed in, in UTF-8 to the application by Windows

I can't for the life of me figure out why Perl is not receiving the argument as UTF-8 at this point.

Of course as an absolute hack, if I was to throw in the following at the bottom of my Perl script, the issue would get resolved. But I would like to understand why Perl is not receiving the argument as UTF-8:

# ... Appended to original script shown at top ...
use Encode qw(encode decode);

sub recode 
{ 
  return encode('UTF-8', decode( 'cp1252', $_[0] ));
}

say "\n@{['='x60]}\n"; # Output separator
say "Original arg: $arg";
say "After recoding CP1252 -> UTF-8: ${\recode($arg)}";

Script execution:

V:\videos>c:\perl\5.32.0\bin\perl test.pl 0’s

New output:

ENV: LC_ALL=en_US.UTF-8 LC_CTYPE={unset} LANG={unset}
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 0030 0092 0073

============================================================

Original arg: 0s
After recoding CP1252 -> UTF-8: 0’s

UPDATE

I built a simple C++ test app to get a better handle on what is happening.

Here is the source code:

#include <cstdint>
#include <cstring>
#include <iostream>
#include <iomanip>

int main(int argc, const char *argv[])
{
  if (argc!=2)
  {
    std::cerr << "A single command line argument is required\n";
    return 1;
  }

  const char *arg=argv[1];
  std::size_t arg_len=strlen(arg);

  // Display argument as a string
  std::cout << "Argument: " << arg << " length: " << arg_len << '\n';

  // Display argument bytes
  // Fill with leading zeroes
  auto orig_fill_char=std::cout.fill('0');

  std::cout << "Bytes of argument, in hex:";
  std::cout << std::hex;
  for (std::size_t arg_idx=0; arg_idx<arg_len; ++arg_idx)
  {
    // Note: The cast to uint16_t is necessary because uint8_t is formatted 
    //       "specially" (i.e., still as a char and not as an int)
    //       The cast through uint8_t is necessary due to sign extension of
    //       the original char if going directly to uint16_t and the (signed) char
    //       value is negative.
    //       I could have also masked off the high byte after the cast, with
    //       insertion code like (Note: Parens required due to precedence):
    //         << (static_cast<uint16_t>(arg[arg_idx]) & 0x00ff)
    //       As they say back in Perl-land, "TMTOWTDI!", and in this case it
    //       amounts to the C++ version of Perl "line noise" no matter which
    //       way you slice it. :)
    std::cout << ' ' 
              << std::setw(2) 
              << static_cast<uint16_t>(static_cast<uint8_t>(arg[arg_idx])); 
  }
  std::cout << '\n';

  // Restore the original fill char and go back to decimal mode
  std::cout << std::setfill(orig_fill_char) << std::dec;
}

Built as 64-bit console based application with the MBCS character set setting, the above code was run with:

testapp.exe 0’s

..., and produced the following output:

Argument: 0s length: 3
Argument bytes: 30 92 73

So, it is Windows, after all, at least in part. I need to build a UNICODE character set version of this app and see what I get.

Final Update on How to Fix This Once and for All

Thanks to Eryk Sun's comments to ikegami's accepted answer and links in that answer, I have found the best solution, at least with regard to Windows 10. I will now outline the specific steps to follow to force Windows to send command-line args into Perl as UTF-8:

A manifest needs to be added to both perl.exe and wperl.exe (if you use that), which tells Windows to use UTF-8 as the active code page (ACP) when executing the perl.exe application. This will tell Windows to pass command line arguments into perl as UTF-8 instead of CP1252.

Changes that Need to be Made

Create the manifest file(s)

Go to the location of your perl.exe (and wperl.exe) and create a file in that (...\bin) directory with the following contents, calling it perl.exe.manifest:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
  <assemblyIdentity type="win32" name="perl.exe" version="6.0.0.0"/>
  <application>
    <windowsSettings>
      <activeCodePage
        xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
      >UTF-8</activeCodePage>
    </windowsSettings>
  </application>
</assembly>

If you also want to modify wperl.exe copy the above file perl.exe.manifest to wperl.exe.manifest and modify that file, replacing the assemblyIdentity line:

  <assemblyIdentity type="win32" name="perl.exe" version="6.0.0.0"/>

with (notice the change of the value assigned to the name attribute from perl.exe to wperl.exe):

  <assemblyIdentity type="win32" name="wperl.exe" version="6.0.0.0"/>

Embed the Manifests in the Executable(s)

The next step is to take the manifest file(s) we just created and embed them in their respective executable(s). Before doing this, be sure to backup the original executables, just in case!

The manifest(s) can be embedded into the executable(s) as follows:

For perl.exe:

mt.exe -manifest perl.exe.manifest -outputresource:perl.exe;#1

For wperl.exe (optional, needed only if you use wperl.exe):

mt.exe -manifest wperl.exe.manifest -outputresource:wperl.exe;#1

If you don't already have the mt.exe executable, it can be found as part of the Windows 10 SDK, presently located at: Download Windows 10 SDK at developer.microsoft.com

Rudimentary Testing and Usage

After making the above changes, UTF-8 command line args become super easy!

Take the following script, simple-test.pl:

use strict;
use warnings;
use v5.32; # Or whatever recent version of Perl you have

# Helper subroutine to provide simple hex table output formatting
sub hexdump
{
  my ($arg)=@_;
  sub BYTES_PER_LINE {16}; # Output 16 hex pairs per line

  for my $chr_idx (0 .. length($arg)-1)
  {
    # Break into groups of 16 hex digit pairs per line
    print sprintf('\n  %02x: ', $chr_idx+1/BYTES_PER_LINE)
      if $chr_idx%BYTES_PER_LINE==0;
    print sprintf('%02x ',ord(substr($arg,$chr_idx,1)));
  }
  say '';
}

# Test app code that makes no mention of Windows, ACPs, or UTF-8 outside
# of stuff that is printed. Other than the call out to chcp to get the
# active code page for informational purposes, it is not particularly tied
# to Windows, either, as long as whatever environment it is run on
# passes the script its arg as UTF-8, of course.
my $arg=shift @ARGV or die 'No argument present';

say "Argument: $arg";
say "Argument byte length: ${\length($arg)} bytes";
print 'Argument UTF-8 data bytes in hex:';
hexdump($arg);

Let's test our script, making sure that we are in the UTF-8 code page (65001):

v:\videos>chcp 65001 && perl.exe simple-test.pl "Работа с 𝟘’𝙨 vis-à-vis 0's using UTF-8"

Output (assuming your console font can handle the special chars):

Active code page: 65001
Argument: Работа с 𝟘’𝙨 vis-à-vis 0's using UTF-8
Argument byte length: 54 bytes
Argument UTF-8 data bytes in hex:
  00: d0 a0 d0 b0 d0 b1 d0 be d1 82 d0 b0 20 d1 81 20
  10: f0 9d 9f 98 e2 80 99 f0 9d 99 a8 20 76 69 73 2d
  20: c3 a0 2d 76 69 73 20 30 27 73 20 75 73 69 6e 67
  30: 20 55 54 46 2d 38

I hope that my solution will help others that run into this issue.


Solution

  • Every Windows system call that deals with strings comes in two varieties: An "A"NSI version that uses the Active Code Page (aka ANSI Code Page), and a "W"ide version that uses UTF-16le.[1] Perl uses the A version of all system calls. That includes the call to get the command line.

    The ACP is hard-coded. (Or maybe Windows asks for the system language during setup and bases it on that? I can't remember.) For example, it's 1252 on my system, and there's nothing I can do to change that. Notably, chcp has no effect on the ACP.

    At least, that was the case until recently. The May 2019 update to Windows added the ability to change the ACP on a per-application basis via its manifest. (The page indicates that it's possible to change the manifest of an existing application.)

    chcp changes the console's CP, but not the encoding used by the A system calls. Setting it to a code page that contains ensures that you can type in , and that Perl can print out a (if properly encoded).[2] Since 65001 contains , you have no problems doing those two things.

    The choice of console's CP (as set by chcp) has no effect on how Perl receives the command line. Because Perl uses the A version of the system calls, the command line will be encoded using the ACP regardless of the console's CP and the OEM CP.


    Based on the fact that fact that is encoded as 92, your system appears to use 1252 for its Active Code Page as well. As such, you could resolve your problem as follows:

    use Encode qw( decode );
    
    my @ARGV = map { decode("cp1252", $_) } @ARGV;
    

    See this post for a more generic and portable solution which also adds the appropriate encoding/decoding layer to STDIN, STDOUT and STDERR.


    But what if you wanted to support arbitrary Unicode characters instead of being limited to those found in your system's ACP? As mentioned above, you could change perl's ACP. Changing it to 650001 (UTF-8) would give you access to the entire Unicode character set.

    Short of doing that, you would need to get the command line from the OS using the W version of the system call and parse it.

    While Perl uses the A version of system calls, this doesn't limit modules from doing the same. They may use W system calls.[3] So maybe there's a module that does what you need. If not, I've previously written code that does just that.


    Many thanks to @Eryk Sun for the input they provided in the comments.


    • The ACP can be obtained using Win32::GetACP().
    • The OEM CP can be obtained using Win32::GetOEMCP().
    • The console's CP can be obtained using Win32::GetConsoleCP() / Win32::GetConsoleOutputCP().

    1. SetFileApisToOEM can be used to change the encoding used by some A system calls to the OEM CP.[2]
    2. The console's CP defaults to the system's OEM CP. This can be overridden by changing the CodePage value of the HKCU\Console\<window title> registry key, where <window title> is the initial window title of the console. Of course, it can also be overridden using chcp and the underlying system calls it makes.
    3. Notably, see Win32::LongPath.