xmlstarlet parser error : Entity '*' not defined

While using xmlstarlet on web pages, I most of time faced entity reference error. which render it useless for extracting from web pages.

As html page are not well formed XML (is there some option to process html also ?) I convert them with

tidy -asxhtml

to XHTML, where tidy put declaration

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

then after processing it with xmlstarlet

curl http://www.xfree86.org/current/index.html |  tidy -asxhtml | \
  xmlstarlet sel --net -T   -t -m hr -v . -

it throw always same error

-:13: parser error : Entity 'reg' not defined
<h1>Documentation for XFree86&reg; version 4.8.0</h1>

Do anybody know how to let xmlsttarlet know the entity reference file.

Solution

Try telling tidy to convert the character entities to numeric ones like this:

curl --silent -q http://www.xfree86.org/current/index.html | \
tidy -q -numeric -asxhtml --show-warnings no  | \
xmlstarlet sel -N xhtml="http://www.w3.org/1999/xhtml" -t -m "//xhtml:hr" -c . -n 2>/dev/null

Here, I added the following options:

Tell curl to be silent with --silent and -q
Tell tidy to be quiet with -q and --show-warnings no
Tell tidy to convert entities to numeric ones with -numeric
Give xmlstarlet the xhtml namespace to use for the XPath with -N and name it xhtml
Change the XPath to match an hr in the namespace xhtml

This works to get rid of the entity not defined error, making the previous commands all silent, and selecting the element you want.

However when I tried doing this with xmlstarlet v1.0.6, I still get this:

Entity: line 1: parser warning : xmlParsePITarget: invalid name prefix 'xml'
<?xmlstarlet version="1.0"?>

Not sure if this really matters, but it seems like a warning that's safe to ignore... so I just output stderr to /dev/null with 2>/dev/null