I want to encode and decode binary data within an XML file (with Python, but whatever). I have to face the fact that an XML tag content has illegal characters. The only allowed ones are described in XML specs:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Which means that the unallowed are:
1 byte can encode 256 possibles. With these restrictions the first byte is limited to 256-29-8-1-3 = 215 possiblities.
Of that first bytes's 215 possibilites, base64 only uses 64 possibilites. Base64 generates 33% overhead (6 bits becomes 1 byte once encoded with base64).
So my question is simple: Is there an algorithm more efficient than base64 to encode binary data within XML? If not, where should we start to create it? (libraries, etc.)
NB: You wouldn't answer this post by "You shouldn't use XML to encode binary data because...". Just don't. You could at best argue why not to use the 215 possibilities for bad XML parser's support.
NB2: I'm not speaking about the second byte but there are certainly some considerations that wa can develop regarding the number of posibilities and the fact it should start by 10xxxxxx to respect UTF8 standard when we use the supplementary Unicode planes (what if not?).
I have developed the concept in a C code.
The project is on GitHub and is finally called BaseXML: https://github.com/kriswebdev/BaseXML
It has a 20% overhead, which is good for a binary safe version.
I had a hard time making it work with Expat, which is the behind the scene XML parser of Python (THAT DOESN'T SUPPORT XML1.1!). So you'll find the BaseXML1.0 Binary safe version for XML1.0.
I will maybe release the "for XML1.1" version later if requested (it is also binary safe and have a 14.7% overhead), it's ready and working indeed but useless with Python built-in XML parsers so I don't want to confuse people with too many versions (yet).