Search code examples
xmlxsltutf-8saxonwindows-1252

Saxon input encoding not recognized?


I get weird characters in a utf-8 text output from Saxon xslt processor.

The input xml is headed with

<?xml version="1.0" encoding="windows-1252"?>

It contains strings like (shown in notepad++ with Windows-1252 encoding shown down right)

“abc”

The transformation stylesheet contains

<xsl:output method="text" encoding="utf-8" />

but the output contains (shown in notepad++ with UTF-8 encoding shown down right)

�abc�

instead of UTF-8 encoded

“abc”

Any idea what I missed?

p.s.: when I use notepad++ to change the xml input from windows-1252 to UTF-8, the output is encoded correctly, and that is my workaround. However I'd like to understand whether I missed something or some software should be improved regarding character sets.


Solution

  • I suspect that although the input is labelled as being windows-1252, it isn't actually Windows-1252.

    First, try to find out whether the problem is on input or on serialization. You can do that by using string-to-codepoints() within the XSLT code to see what actual codepoints are present in the parsed node tree.

    If it's an input problem, then that's the responsibility of the XML parser rather than Saxon itself, so it depends on which XML parser you are using.