Search code examples
javaeclipsemavenencoding

Maven change encoding to certain files


All my project with a Cp1252 encoding, except for a couple of files that I have encoded in UTF-8, which contain special characters.

When I run an install, in those files I get a couple of errors: unclosed character literal, illegal character: '\u00a8'. When doing the install with the plugin with the encoding in UTF8:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <version>2.3.2</version>
    <configuration>
        <source>1.8</source>
        <target>1.8</target>
        <encoding>UTF-8</encoding>
    </configuration>
</plugin>

The error is no longer displayed in the above mentioned files but in many others, the error displayed is: unmappable character for encoding UTF-8.

Can I specify UTF-8 encoding only for some files?


Another thing, maven displays errors as follows:

folder/file.java:[10,19] unclosed character literal
folder/file.java:[10,22] unclosed character literal
folder/file.java:[13,19] unclosed character literal

What the numbers means? it does not seem to be the line numbers where the error is located.


Solution

  • [10,19] means: The 19th character on the 10th line.

    @VGR explained precisely why reading UTF-8 encoded source files in as CP1252 causes compilation to fail: Any non-ASCII character is encoded as at least 2 bytes in UTF-8. If you then incorrectly read those bytes as Cp1252, you get 2 or more gobbledygook characters. Given that char literals only allow 1 character inside, the code now has compiler errors in it.

    There's no way to tell maven that some files are UTF-8 and some files are Cp1252 unless you run separate compilation runs, which is hard to do, would be very confusing and hard to maintain (so, a bad idea), and can't work at all unless you either involve stubs or you're 'lucky' and one of the two batches is 'self contained' (contains absolutely no reference to anything from the other 'batch').

    So let's get rid of that as feasible option. That leaves 2 options:

    The right option - all UTF-8, all the time

    Treat all source files as UTF-8. This is easier than it sounds like; all ASCII characters are encoded identically in UTF-8 and Cp1252, so only non-ASCII characters need to be reviewed. This is easy to find: It's, effectively, all bytes above 126. You can use many tools to go find these. Here is an SO question with answers about how to do this on linux, for example.

    Open those files with any editor that makes it clear which encoding it is using (most developer editors do this right), reload with encodings until this character looks correct, then re-save as UTF-8, voila. All the ones with no special characters are both UTF-8 and Cp1252 at the same time - you can simply compile them using UTF-8 encoding and it'll just work fine.

    Now all your code is in UTF_8. Configure your IDE project accordingly / just leave your maven pom on 'it is UTF-8' and all maven-aware project tools will pick up on this automatically.

    Considerably worse option - backslash-u escaping

    If you can't do that because some tools read those source files (not maven and javac and in fact pretty much nothing major from the java ecosystem as the java ecosystem is all quite UTF-8 aware) and just insist on parsing it out as Cp1252, nothing you can do about it: There is a way to remove all non-ASCII from source files: backslash-u escapes.

    The concept \u0123 is legal anywhere in any java file, and not just in string literals. It means: The unicode character with that value (in hex). For example, this:

    class Test {
      public static void main(String[] args) {
        //This does nothing, right? \u000aSystem.out.println("Hello!");
      }
    }
    

    When you run it, actually prints Hello!. Even though the sysout is in a comment... or is it?

    \u000a is the newline symbol. So, the above file is parsed out as a comment on one line, then a newline, so, that System.out statement really is in there and isn't in a comment. Many tools don't know this (e.g. sublime text and co will render that sysout statement in commenty green), but javac and, in fact, the Java Lang Spec is crystal clear on this: The above code has a real print statement in there, not commented out.

    Thus, you can go hunt for all non-ASCII and replace it with u escapes, and now your code is hybridized: It parses identically regardless of which encoding you use, as long as it's an ASCII compatible encoding, and almost all encodings are (only a few japanese and other east asian charsets, as well as UTF-16/UCS2/UCS4/UTF-32 style encodings, are non-ASCII compatible. Cp1252, Iso-8859, UTF_8 itself, ASCII itself, Cp850, and many many others are 'ASCII compatible', meaning, 100% ASCII text is identically encoded by all these encodings).

    To turn things into u escapes, look up the hexadecimal value of the symbol in any unicode website and apply it. For example, é becomes \u00E9 (see é) and ☃ becomes \u2603 (see unicode snowman).

    Put those escapes in where-ever you see non-ascii in a source file, even if you see it outside of string literals:

    legal java:

    public class Fighter {
      public void mêléeAttack() {}
    }
    

    But.. if you mix up the encoding setting in your editor and the encoding setting in maven that goes badly. However, this:

    public class Fighter {
      public void m\u00EAl\u00E9eeAttack() {}
    }
    

    means the same thing, and works correctly even if you mess up encodings. It just looks real bad in your editors, which is why this is a considerably worse option.