I'm trying to get the html structure of multiple websites using NodeJS, and I'm having difficulties. I want to get just the HTML structure of the document, and no content. I want to preserve classes, IDs, and other attributes.
Example of what I want back:
<title></title>
</head>
<body>
<h1></h1>
<div>
<div class="something">
<p></p>
</div>
</div>
</body>
Any suggestion on how to do this? Thanks
If OP tags his question:
Then why not use the TreeWalker API (available in all browsers.. since 2011)
You do not want to extract HTML tags...
You want to remove textNodes:
function removeTextNodes( root = document.body ) {
let node,tree = document.createTreeWalker(root, NodeFilter.SHOW_TEXT);
while (node = tree.nextNode()) node.textContent = "";
return root.outerHTML;
}
If you do have open shadowRoots, you need to recursively dive deeper into shadowDOMs