Search code examples
javaeclipsecharacter-encodingwindows-console

How to read accented letters from terminal in Java?


I have the following Java snippet:

System.out.print("What is the first name of the Hungarian poet Petőfi? ");
String correctAnswer = "Sándor";
Scanner sc = new Scanner(System.in);
String answer = sc.next();
sc.close();
if (correctAnswer.equals(answer)) {
    System.out.println("Correct!");
} else {
    System.out.println("The answer (" + answer + ") is incorrect, the correct answer is " + correctAnswer);
}

This works fine in Eclipse, but does not work in Windows terminal: even though I enter the correct answer Sándor, the comparison fails. This is how it looks like in Eclipse:

What is the first name of the Hungarian poet Petőfi? Sándor
Correct!

The same from command line:

What is the first name of the Hungarian poet Petőfi? Sándor
The answer (S?ndor) is incorrect, the correct answer is Sándor

What I tried without success are the following:

  • CHCP 65001 (to change code page to UTF-8): this is needed only if the word Petőfi is incorrectly displayed, but does not help the input.
  • [Console]::InputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding
  • Passing StandardCharsets.UTF_8 or "UTF-8" to Scanner.
  • Using InputStreamReader (with and without passing the encoding) instead of Scanner.
  • Passing -Dfile.encoding=UTF-8 command line parameter.
  • Adding line System.setProperty("file.encoding", "UTF-8");
  • Using PowerShell instead of cmd

I double-checked: the encoding of the Java source file is UTF-8.

When converting to bytes (Arrays.toString(input.getBytes())) I experience the following:

  • By default the non-accented letters get their normal ASCII codes. The accented letters have 3 bytes length, all negative values. Before code page change the bytes of Sándor is this: [83, -17, -65, -67, 110, 100, 111, 114].
  • When the code page is changed to UTF-8, then the non-accented letters remain the same, while the accented ones all become one character length 0. After code page change the bytes are this: [83, 0, 110, 100, 111, 114].
  • The same in Eclipse: [83, -61, -95, 110, 100, 111, 114]
  • The String Sándor is actually encoded in Java like this: [83, -61, -95, 110, 100, 111, 114]

So to narrow down to the letter á we have the following:

It works in Git Bash, but the letter ő (and all the other accented characters, not just this one) is incorrectly displayed in that terminal:

What is the first name of the Hungarian poet Pet▒fi? Sándor
Correct!

It is strange that even the comparison works, and entering the accented characters looks fine, repeated displaying the same does not work:

What is the first name of the Hungarian poet Pet▒fi? Péter
The answer (P▒ter) is incorrect, the correct answer is S▒ndor

The following helped in Windows terminal:

Console console = System.console();
String answer = console.readLine();

But this does not work in Eclipse:

What is the first name of the Hungarian poet Petőfi? Sándor
The answer (Sándor) is incorrect, the correct answer is Sándor

UPDATE: it seems it depends on the system settings. I have 2 laptops, one of Hungarian and the other of English settings.

  • The Hungarian one works with System.console() in terminal, and the new Scanner(System.in) works in Eclipse. However, in Eclipse it works incorrectly, even if I change the encoding in Window -> Preferences -> General -> Workspace.
  • In the English version I did not find any working option in terminal: either the output (like the letter ő) is incorrect or the comparison fails. And when trying to use the System.console() approach, then it throws NullPointerException being the console null. (That Eclipse version is 2023-12; I did not try the latest one there.)

My Java version is 22.0.2, but the problem does not seem version-specific.

As a cross-check I tried the same in Python, and it works fine both in the Windows terminal and also in IDE without any problem:

answer = input('What is the first name of the Hungarian poet Petőfi? ')
correct_answer = 'Sándor'
if answer == correct_answer:
    print('Correct')
else:
    print('The answer (' + answer + ') is incorrect, the correct answer is ' + correct_answer)

So my question is: how to make it work? Is there an universal solution which works in both Windows terminal and Eclipse?


Solution

  • Scanner scanner = new Scanner(System.in, System.out.charset());
    

    This solution works with Java 18+. This works both in Eclipse with default settings and in Windows command prompt having code page 852. Checking the code page:

    chcp
    

    Changing it to 852:

    chcp 852
    

    Thanks for everyone who helped reaching the solution!