Search code examples
xmlentitydtdxmlstarlet

xmlstarlet parser error : Entity '*' not defined


While using xmlstarlet on web pages, I most of time faced entity reference error. which render it useless for extracting from web pages.

As html page are not well formed XML (is there some option to process html also ?) I convert them with

tidy -asxhtml 

to XHTML, where tidy put declaration

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

then after processing it with xmlstarlet

curl http://www.xfree86.org/current/index.html |  tidy -asxhtml | \
  xmlstarlet sel --net -T   -t -m hr -v . -

it throw always same error

-:13: parser error : Entity 'reg' not defined
<h1>Documentation for XFree86&reg; version 4.8.0</h1>

Do anybody know how to let xmlsttarlet know the entity reference file.


Solution

  • Try telling tidy to convert the character entities to numeric ones like this:

    curl --silent -q http://www.xfree86.org/current/index.html | \
    tidy -q -numeric -asxhtml --show-warnings no  | \
    xmlstarlet sel -N xhtml="http://www.w3.org/1999/xhtml" -t -m "//xhtml:hr" -c . -n 2>/dev/null
    

    Here, I added the following options:

    • Tell curl to be silent with --silent and -q
    • Tell tidy to be quiet with -q and --show-warnings no
    • Tell tidy to convert entities to numeric ones with -numeric
    • Give xmlstarlet the xhtml namespace to use for the XPath with -N and name it xhtml
    • Change the XPath to match an hr in the namespace xhtml

    This works to get rid of the entity not defined error, making the previous commands all silent, and selecting the element you want.

    However when I tried doing this with xmlstarlet v1.0.6, I still get this:

    Entity: line 1: parser warning : xmlParsePITarget: invalid name prefix 'xml'
    <?xmlstarlet version="1.0"?>
    

    Not sure if this really matters, but it seems like a warning that's safe to ignore... so I just output stderr to /dev/null with 2>/dev/null