While using xmlstarlet on web pages, I most of time faced entity reference error. which render it useless for extracting from web pages.
As html page are not well formed XML (is there some option to process html also ?) I convert them with
tidy -asxhtml
to XHTML, where tidy put declaration
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
then after processing it with xmlstarlet
curl http://www.xfree86.org/current/index.html | tidy -asxhtml | \
xmlstarlet sel --net -T -t -m hr -v . -
it throw always same error
-:13: parser error : Entity 'reg' not defined
<h1>Documentation for XFree86® version 4.8.0</h1>
Do anybody know how to let xmlsttarlet know the entity reference file.
Try telling tidy to convert the character entities to numeric ones like this:
curl --silent -q http://www.xfree86.org/current/index.html | \
tidy -q -numeric -asxhtml --show-warnings no | \
xmlstarlet sel -N xhtml="http://www.w3.org/1999/xhtml" -t -m "//xhtml:hr" -c . -n 2>/dev/null
Here, I added the following options:
--silent
and -q
-q
and --show-warnings no
-numeric
-N
and name it xhtmlhr
in the namespace xhtml
This works to get rid of the entity not defined error, making the previous commands all silent, and selecting the element you want.
However when I tried doing this with xmlstarlet v1.0.6, I still get this:
Entity: line 1: parser warning : xmlParsePITarget: invalid name prefix 'xml'
<?xmlstarlet version="1.0"?>
Not sure if this really matters, but it seems like a warning that's safe to ignore... so I just output stderr to /dev/null with 2>/dev/null