Search code examples
xmlbashxmlstarletnul

How do I separate xmlstarlet output with nul?


To I'm trying use nul (U+0) to delimit xml values in xmlstarlet output. xmlstarlet ignores -o '', -o $'\0', and -o '\0'.

I'm aware that I can use other characters like the various field separators to delimit output. The problem with this approach is that these characters can also exist as data. I don't want any ambiguity.

I want to to use nul specifically because it's the only value that can't be represented in raw XML.

So, to repeat my question: How do I separate xmlstarlet output with nul?

More information

I've included the following information at the request of the folks who requested it. While I appreciate your desire to help, please avoid suggesting XY sulutions. I'm only looking for an answer to my question as presented.

The data I'm working with looks like this:

<data>
    <datapoint attribute-1="val-1" attribute-2="val-a" />
    <datapoint attribute-1="val-2" attribute-2="val-b"  />
    <datapoint attribute-1="val-3">
        <sub-datapoint />
    </datapoint>
</data>

The way I'm trying to use xmlstarlet:

mapfile -tf ARRAY < <( xmlstarlet sel -t -m /data/datapoint -o 'datapoint' -o $'\0' -v ./@attribute-1 -o $'\0' data.xml )

A hexdump of the output I'm looking for:

64 61 74 61 70 6f 69 6e  74 00 76 61 6c 2d 31 00  |datapoint.val-1.|
64 61 74 61 70 6f 69 6e  74 00 76 61 6c 2d 32 00  |datapoint.val-2.|
64 61 74 61 70 6f 69 6e  74 00 76 61 6c 2d 33 00  |datapoint.val-3.|

Solution

  • Unfortunately, xmlstarlet doesn't seem to be capable of producing nul in its output.

    xmlstarlet is however capable of producing U+FFFF; A codepoint that's invalid in all XML versions. You can use this code to safely delimit XML values, and then use another program to replace it with nul:

    xmlstarlet sel -t \
       -m /data/datapoint \
       -o 'datapoint' \
       -o $'\uffff' \
       -v ./@attribute-1 \
       -o $'\uffff' data.xml \
     | python3 -c 'import sys; 
                   sys.stdout.write(sys.stdin.read().replace("\uffff", "\0"))'