I would like to store raw filenames in an XML document but the encoding doesn't allow it.
Here's how you can generate a test.xml
with the shell:
#!/bin/sh
cat <<EOF > test.xml
<?xml version="1.0" encoding="UTF-8"?>
<root>
<music>🎵</music>
<file><![CDATA[$(printf \\xf1.mp3)]]></file>
</root>
EOF
Now if I try to read it with any XML parser (for example python):
import xml.dom.minidom
xml.dom.minidom.parse('test.xml')
I get encoding problems:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 4, column 19
Is there a way to make XML allow any byte (but NUL
)?
Looks like an encoding issue tmp.xml:4: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xF1 0x2E 0x6D 0x70 <file><![CDATA[�.mp3]]></file>
. So printf \xf1 creates a non utf8 character.
Converting the ISO-8859-1 filenames to UTF-8
cat <<EOF > tmp.xml
<?xml version="1.0" encoding="UTF-8"?>
<root>
<music>🎵</music>
<file><![CDATA[$(printf \\xf1.mp3 | iconv -f iso-8859-1 -t utf-8)]]></file>
</root>
EOF
Result
<?xml version="1.0" encoding="UTF-8"?>
<root>
<music>🎵</music>
<file><![CDATA[ñ.mp3]]></file>
</root>
If source enconding is unknown a workaround could be to store base64 strings.
printf \\xf1.mp3 | base64
8S5tcDM=
Changing the encoding of the XML gives no error
cat <<EOF > tmp.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
<music>🎵</music>
<file><![CDATA[$(printf \\xf1.mp3)]]></file>
</root>
EOF
Testing with xmllint
xmllint tmp.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
<music>🎵</music>
<file><![CDATA[�.mp3]]></file>
</root>
A question mark is shown since it's a utf-8 console printing ISO-8859-1 text.