Search code examples
xmlcharacter-encodingsh

Which XML encoding for storing filenames?


I would like to store raw filenames in an XML document but the encoding doesn't allow it.

Here's how you can generate a test.xml with the shell:

#!/bin/sh
cat <<EOF > test.xml
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <music>🎵</music>
    <file><![CDATA[$(printf \\xf1.mp3)]]></file>
</root>
EOF

Now if I try to read it with any XML parser (for example python):

import xml.dom.minidom
xml.dom.minidom.parse('test.xml')

I get encoding problems:

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 4, column 19

Is there a way to make XML allow any byte (but NUL)?


Solution

  • Looks like an encoding issue tmp.xml:4: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xF1 0x2E 0x6D 0x70 <file><![CDATA[�.mp3]]></file> . So printf \xf1 creates a non utf8 character.

    Converting the ISO-8859-1 filenames to UTF-8

    cat <<EOF > tmp.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <root>
        <music>🎵</music>
        <file><![CDATA[$(printf \\xf1.mp3 | iconv -f iso-8859-1 -t utf-8)]]></file>
    </root>
    EOF
    

    Result

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
        <music>🎵</music>
        <file><![CDATA[ñ.mp3]]></file>
    </root>
    

    If source enconding is unknown a workaround could be to store base64 strings.

    printf \\xf1.mp3 | base64
    8S5tcDM=
    

    Changing the encoding of the XML gives no error

    cat <<EOF > tmp.xml
    <?xml version="1.0" encoding="ISO-8859-1"?>
    <root>
        <music>🎵</music>
        <file><![CDATA[$(printf \\xf1.mp3)]]></file>
    </root>
    EOF
    

    Testing with xmllint

    xmllint tmp.xml
    <?xml version="1.0" encoding="ISO-8859-1"?>
    <root>
        <music>🎵</music>
        <file><![CDATA[�.mp3]]></file>
    </root>
    

    A question mark is shown since it's a utf-8 console printing ISO-8859-1 text.