Search code examples
powershellweb-scrapinghtml-agility-pack

Web scraping from dynamic content table on Powershell using PowerHTML module


I'm getting an error when I try to read contents form a table on the web page described in the script. Can anyone please help me with a solution to fix it. Thanks.

@mklement0, Thanks for the detailed explanation. With your help, I was able to extract the table information. However, I'm still unable to extract table rows as it's still returned as null. Can you please help? Please see below. Thanks.
 
$wc = New-Object System.Net.WebClient
$res = $wc.DownloadString('https://datatables.net/examples/data_sources/ajax.html')
$html = ConvertFrom-Html -Content $res

$ScrapeData=[System.Collections.ArrayList]::new()
$ScrapeData+=$n
$table = $html.SelectNodes('//table') | Where-Object { $_.HasClass("display") -or $_.HasClass("dataTable")}

foreach ($row in $table.SelectNodes('//tr') | Where-Object { $_.HasClass("odd") -or $_.HasClass("even")} )
{
    $cnt += 1

    if ($cnt -eq 1) { continue }

    #$name= $row.SelectSingleNode('//th').innerText.Trim() | Where-Object { $_.HasClass('sorting_1')}
    $value=$row.SelectSingleNode('td').innerText.Trim() -replace "\?", " "
    $new_obj = New-Object -TypeName psobject
    $new_obj | Add-Member -MemberType NoteProperty -Value $value
    $ScrapeData+=$new_obj 
}

Write-Output 'Extracted Table Information'
$table
 
Write-Output 'Extracted Book Details Parsed from HTML table'
$ScrapeData

Extracted data as below


Solution

    • The fundamental problem is that the System.Net.WebClient class, via its .DownloadString() method, as well as PowerShell's web cmdlets - Invoke-WebRequest and Invoke-RestMethod - can only ever retrieve static HTML source code, not dynamically rendered HTML.[1]

      • To support extracting content loaded dynamically - via scripts embedded in the source code that only execute when the page is rendered in a browser - you need a full web browser that you can control programmatically - see this answer.

      • Interactively, most browsers offer:

        • a view of the static HTML source code of a page by right-clicking and selecting a shortcut-menu command such as View Page Source or Show Page Source.

        • a view of the dynamically generated HTML by right-clicking and selecting a command such as Inspect or Inspect Element

    • The immediate problems are [update: the cited code is from the original form of the question]:

      • Note:

        • The following assumes use of the ConvertFrom-Html cmdlet from the PSParseHTML module, whose use your code suggests.

          • By default, it returns HtmlAgilityPack.HtmlNode instances from the HtmlAgilityPack .NET library.
        • While the explanations below point to solutions, they are hypothetical, as they would require the dynamically generated HTML to operate on, which, as noted .DownloadString() cannot provide.
          Specifically, the source code doesn't contain any table rows - they are populated dynamically; also, the <table> element has only one class, display.

      • $_.HasClass('display dataTable') looks for a single class name literally named display dataTable, whereas class="display dataTable" in the dynamically generated HTML means that the element has two classes, display and dataTable. Therefore your method call always returns $false.

        • As a result, the $table = ... assignment ends up as $null, which then predictably causes an attempt to call a method on it to fail. Specifically, $table.SelectNodes('//tr') results in error You cannot call a method on a null-valued expression.

        • The logic you were looking for is probably to find elements with class display as well as dataTable, which requires $_.HasClass("display") -and $_.HasClass("dataTable")

      • $_.HasClass("odd", "even") would have become a problem, because the method only accepts a single string.

        • The logic you were looking for is probably to find elements with class odd or class even, which requires $_.HasClass("odd") -or $_.HasClass("even")

    [1] In PowerShell's legacy, ships-with-Windows, Windows-only edition, Windows PowerShell (whose latest and last version is v5.1) - as opposed to the modern, cross-platform PowerShell (Core) 7+ edition - you may still be able to use built-in features to access dynamic content, but - given that these features rely on the long-obsolete Internet Explorer - this will work with fewer and fewer websites over time.