This is my file (UTF-8 encoded):
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Hello World äüö</bar>
</foo>
I would like to use xmllint
to produce this result:
<bar>Hello World äüö</bar>
But every command prints encoded unicode characters:
$ xmllint --xpath "//bar" file.xml
<bar>Hello World äüö</bar>
$ xmllint --xpath "//bar" --encode utf-8 file.xml
<bar>Hello World äüö</bar>
$ xmllint --xpath "//bar" --noenc file.xml
<bar>Hello World äüö</bar>
Do you have any idea how to get the unencoded result? (I can not install other tools like xmlstarlet..).
$ xmllint --version
xmllint: using libxml version 20907
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma
$ locale
LANG=C.utf8
LC_CTYPE="C.utf8"
LC_NUMERIC="C.utf8"
LC_TIME="C.utf8"
LC_COLLATE="C.utf8"
LC_MONETARY="C.utf8"
LC_MESSAGES="C.utf8"
LC_PAPER="C.utf8"
LC_NAME="C.utf8"
LC_ADDRESS="C.utf8"
LC_TELEPHONE="C.utf8"
LC_MEASUREMENT="C.utf8"
LC_IDENTIFICATION="C.utf8"
LC_ALL=
$ cat /etc/*-release
Rocky Linux release 8.8 (Green Obsidian)
Best option seems to be cat
internal shell command
Given
<?xml version="1.0" encoding="UTF-8"?>
<A>
<B>Hello World äüö äüö</B>
</A>
Sending cat <xpath expression>
to internal shell
printf "%s\n" "cat //B/text()" 'bye' | xmllint --shell tmp.xml | grep -Ev '^([/]| -----)'
Hello World äüö äüö
Issue looks related to xmllint
version (libxml2
version in the end). See details below
xmllint --version
xmllint: using libxml version 20914
Using xmllint --shell
echo "cat //B" | xmllint --shell tmp.xml
/ > cat //B
-------
<B>Hello World äüö äüö</B>
--noenc
and no xpath. noenc
takes precedence over noent
which makes sense, all characters in output are ascii.
xmllint --noenc tmp.xml
<?xml version="1.0"?>
<A>
<B>Hello World äüö äüö</B>
</A>
--noent
(looks the default)
xmllint --noent tmp.xml
<?xml version="1.0" encoding="UTF-8"?>
<A>
<B>Hello World äüö äüö</B>
</A>
--xpath
- noenc
is ignored
xmllint --noenc --noent --xpath '//B' tmp.xml
<B>Hello World äüö äüö</B>
xmllint --xpath '//B' tmp.xml
<B>Hello World äüö äüö</B>
--shell
- noenc
is ignored on cat
internal command and enforced on xpath
one.
xmllint --shell --noenc tmp.xml
/ > cat //B/text()
-------
Hello World äüö äüö
/ > xpath //B/text()
Object is a Node Set :
Set contains 1 nodes:
1 TEXT
content=Hello World #C3#A4#C3#BC#C3#B6 #C3#A4#C3#BC#C3#B6
ASCII encoding
xmllint --encode ASCII tmp.xml
<?xml version="1.0" encoding="ASCII"?>
<A>
<B>Hello World äüö äüö</B>
</A>
lxml
pyhton module is also based on libxml so here's a one liner that does the same
python3 -c 'import sys; from lxml import etree;doc=etree.parse(sys.argv[1]); print(doc.xpath(sys.argv[2]))' tmp.xml '//B/text()'
text result
['Hello World äüö äüö']
Serialazing without indicating encoding
python3 -c 'import sys; from lxml import etree;doc=etree.parse(sys.argv[1]); print(etree.tostring(doc.xpath(sys.argv[2])[0]).decode("utf-8"))' tmp.xml '//B'
<B>Hello World äüö äüö</B>
Serialazing with encoding
python3 -c 'import sys; from lxml import etree;doc=etree.parse(sys.argv[1]); print(etree.tostring(doc.xpath(sys.argv[2])[0], encoding="utf-8").decode("utf-8"))' tmp.xml '//B'
<B>Hello World äüö äüö</B>