Search code examples
javascripthtmldomshadow-domtreewalker

How to extract a websites HTML tags in DOM and shadowDOM


I'm trying to get the html structure of multiple websites using NodeJS, and I'm having difficulties. I want to get just the HTML structure of the document, and no content. I want to preserve classes, IDs, and other attributes.

Example of what I want back:

<title></title>
</head>
<body>
  <h1></h1>
  <div>
    <div class="something">
      <p></p>
    </div>
  </div>
</body>

Any suggestion on how to do this? Thanks


Solution

  • If OP tags his question:

    Then why not use the TreeWalker API (available in all browsers.. since 2011)

    You do not want to extract HTML tags...

    You want to remove textNodes:

      function removeTextNodes( root = document.body ) {
        let node,tree = document.createTreeWalker(root, NodeFilter.SHOW_TEXT);
        while (node = tree.nextNode()) node.textContent = "";
        return root.outerHTML;
      }
    
    

    If you do have open shadowRoots, you need to recursively dive deeper into shadowDOMs