I am trying to extract a particular section of a HTML document from a bash shell script and have been using xmlstarlet sel
but I can't quite get it to return actual HTML, rather than just the text values from the HTML tags.
I'm trying a command line as follows:
xmlstarlet sel -t -m "//div[@id='mw-content-text']" -v "." wiki.html
But it is giving text only, without any HTML/XML markup. For info, I'm trying to export this data into a HTML format outside the mediawiki instance it has come from.
If xmlstarlet is the wrong tool, any suggestions for other tools also gratefully received!
-v
means --value-of
which is the contents of tags. You should use -c
or --copy-of
to get the tags themselves.
xmlstarlet sel -t -m "//div[@id='mw-content-text']" -c "." wiki.html
Or just
xmlstarlet sel -t -c "//div[@id='mw-content-text']" wiki.html