Search code examples
.netpowershelldommhtml

DOM Traversal in .mht saved webpage


Is it possible to do DOM traversal in a webpage saved as .mht, or saved as .htm (html only)?
Preferably in powershell or .net
Goal is to be able to do something like getElementsByTagName('div')
If yes, how?


Solution

  • Found a solution using HtmlAgilityPack.
    Documentation can be found on NuDoq, which was mentioned in this post.

    Example code:

    # Choose a source
    $Source = 'C:\temp\myFile.mht'
    $Source = 'http://www.google.com'
    
    # Get online or mht content
    $IE = New-Object -ComObject InternetExplorer.Application
    
    # Don't show the browser
    $IE.Visible = $false
    
    # Browse to your webpage/file
    $IE.Navigate($Source)
    
    # Wait for page to load
    while ($IE.busy) { Sleep -Milliseconds 50 }
    
    # Get the html from that page
    $Html = $IE.Document.body.parentElement.outerHTML
    
    # Decode to get rid of html encoded characters like & etc...
    $Html = [System.Web.HttpUtility]::HtmlDecode($Html)
    
    # Close the browser
    $IE.Quit()
    
    
    # Use HtmlAgilityPack (must be installed first)
    Add-Type -Path (Join-Path $Env:userprofile '.nuget\packages\htmlagilitypack\1.4.9.5\lib\Net40\HtmlAgilityPack.dll')
    $Hap = New-Object HtmlAgilityPack.HtmlDocument
    
    # Load the Html in HtmlAgilityPack to get a DOM
    $Hap.LoadHtml($global:Html)
    
    # Retrieve the data from the DOM (read a node)
    [string]$partData = $Hap.DocumentNode.SelectSingleNode("//div[@class='formatted_content']/ul").InnerText