Search code examples
javastringunicode-normalization

Why does my text normalization behave differently in different environments?


I am normalizing some accented text using the following approach / code taken from this answer

Accent removal:

String accented = "árvíztűrő tükörfúrógép";
String normalized = Normalizer.normalize(accented,  Normalizer.Form.NFD);
normalized = normalized.replaceAll("[^\\p{ASCII}]", "");
System.out.println(normalized);

When I run this from with IntelliJ (as part of a unit test), this gives the expected result:

arvizturo tukorfurogep

If I run this from the command line (via gradle), I get:

ArvAztArA tAkArfArAgA

In both cases, I'm using the same PC and Java 1.8.0_151.

The relevant parts from build.gradle:

apply plugin: 'java'
apply plugin: 'idea'
sourceCompatibility = 1.8
targetCompatibility = 1.8
dependencies {
  testCompile group: 'junit', name: 'junit', version: '4.12'
}

What causes this different behaviour? And how do I ensure I get the expected result everywhere?


Solution

  • Thanks to @eckes and others for the compile time suggestion. By specifying an encoding at compile time, I was able to get the desired result.

    The setting I added to build.gradle was:

    compileTestJava.options.encoding = 'UTF-8'
    

    This option only affects the test classes (which is where my issue was). You can also use:

    compileJava.options.encoding = 'UTF-8'
    

    if you have text in your production code that needs to be encoded.

    An alternative solution I came across is:

    tasks.withType(JavaCompile) {
      options.encoding = 'UTF-8'
    }
    

    (Interestingly, none of the above solutions changed the value of the file.encoding system property.)