Search code examples
xsltencodingattributesxml-declaration

Printing the encoding attribute in XML files with XSL


I have a long list of xml files which may have different encodings. I would like to go through all the files and print their encodings. Printing the encoding attribute in the XML header is just a first step. (The next step, once I find out how to get access to the encoding attribute would be to use the encoding attribute to test if this is the expecting encoding.)

This is how input xml files may look like:

<?xml version="1.0" encoding="iso-8859-1"?>
<Resource Name="text1" Language="de">
    <Text>
    </Text>
</Resource>


<?xml version="1.0" encoding="utf-8"?>
<Resource Name="file2" Language="ko">
    <Text>
    </Text>
</Resource>

The xsl, which has been cut down to a minimum but still without any success. I think I fail to match the XML header by writing this way. But how can I match something in the XML header?

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="html"/>

    <xsl:template match="/">
     <html>
        <body>   
            <xsl:value-of select="@encoding"/>
        </body>
     </html>
    </xsl:template>
</xsl:stylesheet>

Solution

  • The encoding pseudo-attribute of the XML prolog is not relevant anymore after you read the XML with an XML capable processor. Unless the encoding in the prolog does not match the encoding used and the file contains characters that cannot be represented in that encoding.

    The only way I know of to use XSLT to get the encoding is to use the functions unparsed-text (XSLT 2.0) or unparsed-text-lines (XSLT 3.0) and then use regular expressions (replace or xsl:analyze-string, both XSLT 2.0) to parse the prolog by hand.

    Since XSLT (and most XML capable tools and processors) see XML not as a text file but as a set of nodes with streams of characters, not streams bytes, the requirement to read the encoding is hardly ever needed.

    If you want to know the encoding for functions like document, doc or unparsed-text, those functions are defined such that they will read the encoding from the prolog and use that. In XSLT 3.0 you can use try/catch to find out whether or not it succeeded to parse a file. In XSLT 2.0 you have doc-available, which will return false if the encoding does not match the bytes used.