javascript html dom shadow-dom treewalker

How to extract a websites HTML tags in DOM and shadowDOM

I'm trying to get the html structure of multiple websites using NodeJS, and I'm having difficulties. I want to get just the HTML structure of the document, and no content. I want to preserve classes, IDs, and other attributes.

Example of what I want back:

<title></title>
</head>
<body>
  <h1></h1>
  <div>
    <div class="something">
      <p></p>
    </div>
  </div>
</body>

Any suggestion on how to do this? Thanks

Solution

If OP tags his question:

Then why not use the TreeWalker API (available in all browsers.. since 2011)

You do not want to extract HTML tags...

You want to remove textNodes:

  function removeTextNodes( root = document.body ) {
    let node,tree = document.createTreeWalker(root, NodeFilter.SHOW_TEXT);
    while (node = tree.nextNode()) node.textContent = "";
    return root.outerHTML;
  }

If you do have open shadowRoots, you need to recursively dive deeper into shadowDOMs