Search code examples
bashxmllint

xmllint encodes special chars


This is my file (UTF-8 encoded):

<?xml version="1.0" encoding="UTF-8"?>
<foo>
  <bar>Hello World äüö</bar>
</foo>

I would like to use xmllint to produce this result:

<bar>Hello World äüö</bar>

But every command prints encoded unicode characters:

$ xmllint --xpath "//bar" file.xml
<bar>Hello World &#xE4;&#xFC;&#xF6;</bar>
$ xmllint --xpath "//bar" --encode utf-8 file.xml
<bar>Hello World &#xE4;&#xFC;&#xF6;</bar>
$ xmllint --xpath "//bar" --noenc file.xml
<bar>Hello World &#xE4;&#xFC;&#xF6;</bar>

Do you have any idea how to get the unencoded result? (I can not install other tools like xmlstarlet..).

$ xmllint --version
xmllint: using libxml version 20907
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma
$ locale
LANG=C.utf8
LC_CTYPE="C.utf8"
LC_NUMERIC="C.utf8"
LC_TIME="C.utf8"
LC_COLLATE="C.utf8"
LC_MONETARY="C.utf8"
LC_MESSAGES="C.utf8"
LC_PAPER="C.utf8"
LC_NAME="C.utf8"
LC_ADDRESS="C.utf8"
LC_TELEPHONE="C.utf8"
LC_MEASUREMENT="C.utf8"
LC_IDENTIFICATION="C.utf8"
LC_ALL=
$ cat /etc/*-release
Rocky Linux release 8.8 (Green Obsidian)

Solution

  • Best option seems to be cat internal shell command

    Given

    <?xml version="1.0" encoding="UTF-8"?>
    <A>
      <B>Hello World äüö  &#xE4;&#xFC;&#xF6;</B>
    </A>
    

    Sending cat <xpath expression> to internal shell

    printf "%s\n" "cat //B/text()" 'bye' |  xmllint --shell tmp.xml | grep -Ev '^([/]| -----)'
    Hello World äüö  äüö
    

    Issue looks related to xmllint version (libxml2 version in the end). See details below

    xmllint --version
    xmllint: using libxml version 20914
    

    Using xmllint --shell

    echo "cat //B" | xmllint --shell tmp.xml 
    / > cat //B
     -------
    <B>Hello World äüö  äüö</B>
    

    --noenc and no xpath. noenctakes precedence over noent which makes sense, all characters in output are ascii.

    xmllint --noenc tmp.xml 
    <?xml version="1.0"?>
    <A>
      <B>Hello World &#xE4;&#xFC;&#xF6;  &#xE4;&#xFC;&#xF6;</B>
    </A>
    

    --noent(looks the default)

    xmllint --noent tmp.xml 
    <?xml version="1.0" encoding="UTF-8"?>
    <A>
      <B>Hello World äüö  äüö</B>
    </A>
    

    --xpath - noenc is ignored

    xmllint --noenc --noent --xpath '//B' tmp.xml 
    <B>Hello World äüö  äüö</B>
    
    xmllint --xpath '//B' tmp.xml 
    <B>Hello World äüö  äüö</B>
    

    --shell - noenc is ignored on cat internal command and enforced on xpath one.

    xmllint --shell --noenc tmp.xml 
    / > cat //B/text()
     -------
    Hello World äüö  äüö
    / > xpath //B/text()
    Object is a Node Set :
    Set contains 1 nodes:
    1  TEXT
        content=Hello World #C3#A4#C3#BC#C3#B6  #C3#A4#C3#BC#C3#B6
    

    ASCII encoding

    xmllint --encode ASCII tmp.xml
    <?xml version="1.0" encoding="ASCII"?>
    <A>
      <B>Hello World &#228;&#252;&#246;  &#228;&#252;&#246;</B>
    </A>
    

    lxml pyhton module is also based on libxml so here's a one liner that does the same

    python3 -c 'import sys; from lxml import etree;doc=etree.parse(sys.argv[1]); print(doc.xpath(sys.argv[2]))' tmp.xml '//B/text()'
    

    text result

    ['Hello World äüö  äüö']
    

    Serialazing without indicating encoding

    python3 -c 'import sys; from lxml import etree;doc=etree.parse(sys.argv[1]); print(etree.tostring(doc.xpath(sys.argv[2])[0]).decode("utf-8"))' tmp.xml '//B'
    <B>Hello World &#228;&#252;&#246;  &#228;&#252;&#246;</B>
    

    Serialazing with encoding

    python3 -c 'import sys; from lxml import etree;doc=etree.parse(sys.argv[1]); print(etree.tostring(doc.xpath(sys.argv[2])[0], encoding="utf-8").decode("utf-8"))' tmp.xml '//B'
    <B>Hello World äüö  äüö</B>