Search code examples
javaapacheconfiguration-filesunicode-escapes

Apache's FileHandler saves unicode escapes instead of special chars


I've been using Apache Commons Configuration2 to manage property files. The problem is, when saving the configurations to a file, the special characters are replaced with their Java's source code; as in a=Rumänien -> a=Rum\u00E4nien.

Is there any way to avoid this? Preferably using FileHandlers or similar Writers/Streams, as I'm not being able to use the Builder provided by Apache.

The reading is correctly performed. If I set a breakpoint I can see the correct value stored but, as soon as I want to persist the configuration, I get that kind of result.

Here's a MCVE, please note that you need to link the following Apache libraries:

  • Configuration2
  • Logging
  • Lang

    public static void main(final String[] args) {
        final String inputPath = "C:\\yourFullPath\\properties_in.cfg";
        final String outputPath = "C:\\yourFullPath\\properties_out.cfg";
    
        final PropertiesConfiguration config = new PropertiesConfiguration();
    
        try {
            // Load the config
            final FileHandler inputHandler = new FileHandler(config);
            inputHandler.setEncoding("UTF-8");
            inputHandler.setPath(inputPath);
    
            inputHandler.load();
    
            // Save the config in a different file
            final FileHandler outputHandler = new FileHandler(config);
            outputHandler.setEncoding("UTF-8");
            outputHandler.setPath(outputPath);
    
            outputHandler.save();
        } catch (final Exception e) {
            e.printStackTrace();
        }
    }
    

The content of properties_in.cfg before and after running the code is a=Rumänien.

properties_out.cfg doesn't exist before running the code, and after doing it its content is a=Rum\u00E4nien


Solution

  • The reason for this is, the strict definition for Java's property files requires them to be in ISO-8859-1 encoding, and all Unicode characters not in it will be encoded with the \uXXXX escapes. So technically speaking everything works as specified.

    If the library allows it (maybe with a custom writer), you could hack it to write UTF-8 and not perform the escapes.