Search code examples
javaxmljaxbapache-camelmarshalling

XML content still ISO 8859-1 after UTF-8 JAXB Marshalling


I'am using camel to create a JAXB object, marshall it and write then the result in UTF-8 encoded XML file. Some of my xml content is fetched from a datasource which is using an ISO 8859-1 encoding:

hier is my camel route:

import org.apache.camel.converter.jaxb.JaxbDataFormat;

JaxbDataFormat jaxbDataFormat = new JaxbDataFormat(Claz.class.getPackage().getName());

from("endpoint")

   .process(//createObjectBySettingTheDataFromSource)

   .marshal(jaxbDataFormat)

   .to(FILEENDPOINT?charset=utf-8&fileName=" +Filename);

The XML is generated successfully, but the data content fetched from the source still in the ISO encoding and not resolved with UTF8.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>     
     <Name>M��e Faࠥnder</Name> //Mürthe Faßender 

by changing the file encoding to ISO 8859-1 the content is resolved successfully.

I tried to convert the data before setting it in the JAXB object but still not resolved in UTF-8.

  byte[] nameBytes = name.getBytes(StandardCharsets.ISO_8859_1);
  return new String(nameBytes, StandardCharsets.UTF_8);

The problem is only accuring under Linux, does any one have an idea how to manipulate the ISO_8859_1 data and set it without issues in the xml ?


Solution

  • Well, UTF-8 is the default charset (at least for the file endpoint) and AFAIK Camel does not try to analyze the given charset of an input message.

    So I guess that if you don't declare an input charset different than UTF-8 and then write a file as UTF-8 there is no need to convert anything from Camels perspective.

    .from("file:inbox") // implicit UTF-8
    .to("file:outbox?charset=utf-8") // same charset, no conversion needed
    

    You can, at least for files, declare the source encoding so that Camel knows it must convert the payload.

    .from("file:inbox?charset=iso-8859-1") 
    .to("file:outbox?charset=utf-8") // conversion needed
    

    If you cannot declare the input charset (I think this depends on the endpoint type), you have to explicitly convert the payload.

    .from("file:inbox") 
    .convertBodyTo(byte[].class, "utf-8")
    // message body is now a byte array and written to file as is
    .to("file:outbox") 
    

    See the section "Using charset" from the Camel File docs for more details.