Search code examples
xmlxhtmlxmlstarlet

how to? xmlstarlet to extract HTML data by id


I have a simple task that has me pulling my hair out, i'm sure i'm very close.

here is my xhtml file:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<title>Test Page</title>
</head>

<body>

<p>
test
</p>

<table id="test_table">
<tr><td>test</td><td>test</td></tr>
<tr><th>mo test</th></tr>
</table>

</body>

</html>

... and xmlstarlet likes it:

$ xmlstarlet.exe el -v test.xhtml
html[@xmlns='http://www.w3.org/1999/xhtml']
html/head
html/head/title
html/body
html/body/p
html/body/table[@id='test_table']
html/body/table/tr
html/body/table/tr/td
html/body/table/tr/td
html/body/table/tr
html/body/table/tr/th

what i need to do is extract the data in the table tag, preferably without the HTML. the context for this is i am writing a test set where a web page is called then written to file. the test requires me to validate the table data but allow the test to succeed if other things on the page change. Also, i will not know in advance how many columns or rows the table will have, it can vary based on the data.

but when i try:

$ xmlstarlet.exe sel -t -c "/html/body/table[@id='test_table']" test.xhtml
Attempt to load network entity http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
None of the XPaths matched; to match a node in the default namespace
use '_' as the prefix (see section 5.1 in the manual).
For instance, use /_:node instead of /node

there are different id's i need for different tests, but they all have unique id values. so, given any 'id' in xhthml, i need it's data.

thanks in advance.


Solution

  • The html data has a default namespace that you have to declare in the xmlstarlet command:

    xmlstarlet sel \
        -N n="http://www.w3.org/1999/xhtml" \
        -t \
        -c "/n:html/n:body/n:table[@id='test_table']/descendant::*/text()" \
    htmlfile 2>/dev/null
    

    Once located the <table> element I use descendant::*/text() to extract all text elements of it, and also use 2>/dev/null to skip the warning:

    Attempt to load network entity http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
    

    It yields:

    testtestmo test
    

    UPDATE: I didn't know it but as the error message says, there is no need to declare the namespace when it's the default one, so also this works:

    xmlstarlet sel \
        -t \
        -c "/_:html/_:body/_:table[@id='test_table']/descendant::*/text()" \
    htmlfile 2>/dev/null