I feel that I'm missing something subtle here.
I have a $doc
which I can see with $doc asText
really contains the content of the page to be parsed. It came from dom parse -html5 $body
.
From here, I'd like to interactively explore the DOM. For example, to get a list of anchors. It seems like $doc selectNodes {//a}
would work*, but that doesn't return anything. Neither does anything else I try with selectNodes
(/head, /body, /html ...nothing!). I can see that there are childNodes
so the structure seems to be intact.
What is the better way to explore these nodes so I can figure out what is going wrong?
You can simplify your life, this time, as you seem to work with HTML (not XML, or XHTML for that matter) because you pass -html5
to dom parse
, and you select for HTML elements (anchors).
So far, HTML has no meaning of namespaces, so you may ignore them. Use the -ignorexmlns flag to dom parse
.
% package req tdom
0.9.2
% set someHTML {<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Title of the document</title></head><body>
<svg width="100" height="100">
<circle cx="50" cy="50" r="40" stroke="green" stroke-width="4" fill="yellow" />
</svg>
</body>
</html>}
% set doc [dom parse -html5 -ignorexmlns $someHTML]
This way, you will be able to run your XPath queries, expressions w/o namespace awareness:
$doc selectNodes {//svg}
Note that is a recommended use of tDOM:
Since this probably isn't wanted by a lot of users and adds only burden for no good in a lot of use cases -html5 can be combined with -ignorexmlns, in which case all nodes and attributes in the DOM tree are not in an XML namespace.