Search code examples
rsshutf-8

Why does R treat non-ASCII characters differently depending on the SSH client's OS?


When run over SSH, R appears to treat non-ASCII characters differently depending on the OS of the SSH client.

For example, if I use a computer running macOS (14.6.1) to start an R session on an Ubuntu machine (22.04.5), and run:

units::set_units(12.7, "\U00B5m")

I get:

12.7 [µm]

But the same expression run on the same server, but by a Windows client (10.0.19045.4170) produces:

Error: In '<U+00B5>m', '<U+00B5>m' is not recognized by udunits.

I thought that this could have to do with how the command line on each OS sends the character representations over SSH. However, if I save the following script on the server (written using vim over SSH from the macOS machine):

#!/bin/Rscript

print(nchar("µm"))

And execute it over SSH from the macOS client (e.g., ssh <user>@<host> "./print_micron.R"), I get:

[1] 2

i.e., "µ" is a single two-byte character. But if I execute it from the Windows client, I get:

[1] 3

i.e., "µ" becomes two separate characters, one for each byte.

This is challenging my intuition of how executing commands on SSH works, as I would expect the behavior of R to be determined entirely by the server. Why would the client OS affect how non-ASCII characters are represented by R?


Solution

  • Your Mac probably has LANG=en_US.UTF-8 (or something similar) in the environment, which sets the default locale to use the UTF-8 encoding. It probably also has SSH configured to forward that environment variable to the server (SendEnv LANG in the SSH config). That causes R to use UTF-8 for its internal encoding and for reading source files.

    Your Windows SSH client, on the other hand, is likely not sending any such variable, and nothing on the server is defaulting it, so you get the C locale, which is ASCII-only. That causes units to not know what character B5 means (there are no characters above 7F in ASCII!), and causes the string literal in your test script to be interpreted as three characters (one per byte) instead of two.

    You should be able to see the difference by running sessionInfo() and l10n_info() from the two different clients: they will show different values for locale, codeset, and UTF-8.

    If your windows SSH client can handle UTF-8, you should be able to either

    1. Add LANG to the list of environment variables it sends to the server, if it has a config for that.
    2. Add export LANG=en_US.UTF-8 (or whatever value is appropriate for you) to your ~/.profile or ~/.bashrc on the server (or whatever you've got that gets automatically sourced when you log in).