windows powershell character-encoding mosquitto powershell-core

On Windows, PowerShell misinterprets non-ASCII characters in mosquitto_sub output

Note: This self-answered question describes a problem that is specific to using Eclipse Mosquitto on Windows, where it affects both Windows PowerShell and the cross-platform PowerShell (Core) edition, however.

I use something like the following mosquitto_pub command to publish a message:

mosquitto_pub -h test.mosquitto.org -t tofol/test -m '{ \"label\": \"eé\" }'

^{Note: The extra \-escaping of the " characters, still required as of Powershell 7.1, shouldn't be necessary, but that is a separate problem - see this answer.}

Receiving that message via mosquitto_sub unexpectedly mangles the non-ASCII character é and prints Θ instead:

PS> $msg = mosquitto_sub -h test.mosquitto.org -t tofol/test; $msg

{ "label": "eΘ" }  # !! Note the 'Θ' instead of 'é'

Why does this happen?
How do I fix the problem?

Solution

Problem:

While the mosquitto_sub man page makes no mention of character encoding as of this writing, it seems that on Windows mosquitto_sub exhibits nonstandard behavior in that it uses the system's active ANSI code page to encode its string output rather than the OEM code page that console applications are expected to use.^[1]

There also appears to be no option that would allow you to specify what encoding to use.

PowerShell decodes output from external applications into .NET strings, based on the encoding stored in [Console]::OutputEncoding, which defaults to the OEM code page. Therefore, when it sees the ANSI byte representation of character é, 0xe9, in the output, it interprets it as the OEM representation, where it represents character Θ (the assumption is that the active ANSI code page is Windows-1252, and the active OEM code page IBM437, as is the case in US-English systems, for instance).

You can verify this as follows:

# 0xe9 is "é" in the (Windows-1252) ANSI code page, and coincides with *Unicode* code point
# U+00E9; in the (IBM437) OEM code page, 0xe9 represents "Θ".
PS> $oemEnc = [System.Text.Encoding]::GetEncoding([int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage OEMCP)); 
    $oemEnc.GetString([byte[]] 0xe9)

Θ   # Greek capital letter theta

Note that the decoding to .NET strings (System.String) that invariably happens means that the characters are stored as UTF-16 code units in memory, essentially as [uint16] values underlying the System.Char instances that make up a .NET string. Such a code unit encodes a Unicode character either in full, or - for characters outside the so-called BMP (Basic Multilingual Plane) - half of a Unicode character, as part of a so-called surrogate pair.

In the case at hand this means that the Θ character is stored as a different code point, namely a Unicode code point: Θ (Greek capital letter theta, U+0398).

Solution:

Note: A simple way to solve the problem is to activate system-wide support for UTF-8 (available in Windows 10), which sets both the ANSI and the OEM code page to 65001, i.e. UTF-8. However, this feature is (a) still in beta as of this writing and (b) has far-reaching consequences - see this answer for details.
However, it amounts to the most fundamental solution, as it also makes cross-platform Mosquitto use work properly (on Unix-like platforms, Mosquitto uses UTF-8).

PowerShell must be instructed what character encoding to use in this case, which can be done as follows:

PS> $msg = & { 
      # Save the original console output encoding...
      $prevEnc = [Console]::OutputEncoding
      # ... and (temporarily) set it to the active ANSI code page.
      # Note: In *Windows PowerShell* - only - [System.TextEncoding]::Default work as the RHS too.
      [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP))

      # Now PowerShell will decode mosquitto_sub's output  correctly.
      mosquitto_sub -h test.mosquitto.org -t tofol/test

      # Restore the original encoding.
      [Console]::OutputEncoding = $prevEnc
    }; $msg

{ "label": "eé" }  # OK

^{Note: The Get-ItemPropertyValue cmdlet requires PowerShell version 5 or higher; in earlier version, either use [Console]::OutputEncoding = [System.TextEncoding]::Default or, if the code must also run in PowerShell (Core), [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([int] (Get-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP).ACP)}

Helper function Invoke-WithEncoding can encapsulate this process for you. You can install it directly from a Gist as follows (I can assure you that doing so is safe, but you should always check):

# Download and define advanced function Invoke-WithEncoding in the current session.
irm https://gist.github.com/mklement0/ef57aea441ea8bd43387a7d7edfc6c19/raw/Invoke-WithEncoding.ps1 | iex

The workaround then simplifies to:

PS> Invoke-WithEncoding -Encoding Ansi { mosquitto_sub -h test.mosquitto.org -t tofol/test }

{ "label": "eé" }  # OK

A similar function focused on diagnostic output is Debug-NativeInOutput, discussed in this answer.

As an aside:

While PowerShell isn't the problem here, it too can exhibit problematic character-encoding behavior.

GitHub issue #7233 proposes making PowerShell (Core) windows default to UTF-8 to minimize encoding problems with most modern command-line programs (it wouldn't help with mosquitto_sub, however), and this comment fleshes out the proposal.

^{[1] Note that Python too exhibits this nonstandard behavior, but it offers UTF-8 encoding as an opt-in, either by setting environment variable PYTHONUTF8 to 1, or via the v3.7+ CLI option -X utf8 (must be specified case-exactly!).}