Search code examples
htmlbashxmlstarletmediawiki-extensions

Using xmlstarlet to extract HTML


I am trying to extract a particular section of a HTML document from a bash shell script and have been using xmlstarlet sel but I can't quite get it to return actual HTML, rather than just the text values from the HTML tags.

I'm trying a command line as follows:

xmlstarlet sel -t -m "//div[@id='mw-content-text']" -v "." wiki.html

But it is giving text only, without any HTML/XML markup. For info, I'm trying to export this data into a HTML format outside the mediawiki instance it has come from.

If xmlstarlet is the wrong tool, any suggestions for other tools also gratefully received!


Solution

  • -v means --value-of which is the contents of tags. You should use -c or --copy-of to get the tags themselves.

    xmlstarlet sel -t -m "//div[@id='mw-content-text']" -c "." wiki.html
    

    Or just

    xmlstarlet sel -t -c "//div[@id='mw-content-text']" wiki.html