Search code examples
powershellgroovyencodingcommand-linelocale

Groovy cyrillic characters output problem


I have a problem with output in groovy script. For example this code:

def rusAlphabet = 'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ'
def lowerCaseRusAlphabet = 'абвгдеёжзийклмнопрстуфхцшщъыьэюя'

println(rusAlphabet)
println(rusAlphabet.toLowerCase())
println(lowerCaseRusAlphabet)

prints:

AБВГДЕ?ЖЗИЙКЛМ?ОПРСТУФХЦЧШЩЪЫЬЭЮЯ
a??
абвгдеёжзийклмнопр?туфхцшщъыь?ю?

It works fine with Python scripts. I work on Windows 10 x64.

In CMD and PowerShell cyrillic characters were displayed as questions. Then I checked "Beta: Use Unicode UTF-8 for worldwide language support" in region administrative settings. Now it works fine, characters are displayed normally. But not for groovy scripts.

Tried this code in script:

try {
    System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, "UTF-8"));
} catch (UnsupportedEncodingException e) {
    throw new InternalError("VM does not support mandatory encoding UTF-8");
}

It prints:

AБВГДЕ�ЖЗИЙКЛМ�ОПРСТУФХЦЧШЩЪЫЬЭЮЯ
að‘ð’ð“ð”ð•ð�ð–ð—ð˜ð™ðšð›ðœð�ðžðÿð ð¡ð¢ð£ð¤ð¥ð¦ð§ð¨ð©ðªð«ð¬ð­ð®ð¯
абвгдеёжзийклмнопр�туфхцшщъыь�ю�

Solution

    • I would have expected your activation of system-wide support for UTF-8 (Windows code page 65001) to solve your problem, because it sets both the OEM and the ANSI code page to 65001, which should make all legacy (non-Unicode) programs "speak UTF-8".

      • Note that activating this feature - while convenient - has far-reaching consequences and can break legacy code: see this answer for background information.

      • If you do not use this feature, the following is required in addition to ensuring that source code is read as UTF-8 (see next major point):

        • As shown in this answer mentioned in the comments, you must switch stdout and stderr (the standard output and standard error streams) to UTF-8:[1]

          System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, "UTF-8"));
          System.setErr(new PrintStream(new FileOutputStream(FileDescriptor.err), true, "UTF-8"));
          
        • You also need the execute the following to make a PowerShell session use UTF-8 consistently (see this answer for background information):

          $OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding
          
    • Your problem implies that Groovy doesn't interpret your source code file (script file) as UTF-8 but rather as Windows-1252, which is the ANSI code page for the US-English locale as well as many European ones.

      • Groovy, perhaps needless to say, is based on Java, and Java versions 17 and below use the system's ANSI code page to interpret source code files, whereas v18+ commendably uses UTF-8. As such, with the ANSI code page being 65001, i.e. UTF-8, this shouldn't be a problem - but perhaps Java determines what the active ANSI code page is differently.

      • However, irrespective of whether you've activated system-wide UTF-8 support, you can explicitly instruct Groovy / Java to interpret source code as UTF-8, as follows:

        • groovy `-Dfile.encoding=UTF8 <your-Groovy-script>

          • Note the ` before -, which is only necessary when calling from PowerShell, due to an unfortunate bug - see GitHub issue #6291.
        • Alternatively, you can preset this option via the JAVA_TOOL_OPTIONS environment variable (e.g., from PowerShell, for the current process:
          $env:JAVA_TOOL_OPTIONS = '-Dfile.encoding=UTF8'), though note that the Groovy CLI will then print a message indicating use of the environment variable.


    [1] Note: I'm unclear on how to also switch stdin (the standard input stream) to UTF-8 for text-based operations; do tell us if you know.