Search code examples
htmlpowershellwebscriptinginvoke-webrequest

Fetching Table from website | web scrapping - Powershell Invoke-webrequest


Trying to fetch the table of content using h3 tag.

link- https://learn.microsoft.com/en-us/troubleshoot/sql/releases/download-and-install-latest-updates

Trying to fetch the whole table of 'SQL Server 2019'. and convrting into json object in powershell.

| Build number or version | Service pack | Update | ..
|        15.0.4298.1      |     None     |  CU19  | ..
|        15.0.4280.7      |     None     |  CU18 + GDR | ..
...

...........................Tried this

` Send a request to the URL and get the HTML content using basic parsing

$response = Invoke-WebRequest -Uri "https://learn.microsoft.com/en-us/troubleshoot/sql/releases/download-and-install-latest-updates" -UseBasicParsing
$response = $response.content

"Find the table which has an h3 tag containing "sql-server-2017""

$table = ($response.ParsedHtml).getElementsByTagName("h3") | Where-Object {$_.innerText -eq "SQL Server 2017"} | Select-Object -ExpandProperty parentNode | Select-Object -ExpandProperty nextSibling

Print the table content

Write-Output $table.outerHTML

but not working


Solution

  • The -UseBasicParsing parameter means you won't have any ParsedHtml data. This is the default on powershell 6+. You can confirm by just inspecting the property

    $response = Invoke-WebRequest -Uri "https://learn.microsoft.com/en-us/troubleshoot/sql/releases/download-and-install-latest-updates" -UseBasicParsing
    $response
    

    enter image description here

    You can use a HTML parser with the html data you retrieve. There are even modules made just for this.

    You can also use regular expressions for a quick and dirty solution. This is not recommended as it's cumbersome and fragile as the slightest change in the website can completely break your solution. Here is an example of a quick, and very dirty regex extraction.

    $tablepattern = '(?s)(?<=/table>\r?\n)<h3.+?sql-server-2019.+?(?=<h3)'
    $datapattern = '(?s)<tr.+?td>(?<Version>.+?)<.+?td>(?<ServicePack>[^<]+?)</.+?>(?<Update>[^<]+?)</.+?href="(?<KBArticle>.+?)".+?>(?<KBNumber>[^<]+?)<.+?<td>(?<ReleaseDate>.+?)</'
    
    $response = Invoke-WebRequest -Uri "https://learn.microsoft.com/en-us/troubleshoot/sql/releases/download-and-install-latest-updates" -UseBasicParsing
    $null = $response.RawContent -match $tablepattern
    
    [string]$table = $matches.Values
    
    $data = $table -split '(?=<tr>)' | ForEach-Object{
        if($_ -match $datapattern){
            $matches.Remove(0)
            [PSCustomObject]$matches
        }
    }
    
    $data | Format-Table
    
    KBNumber Version      ReleaseDate        ServicePack KBArticle                                  Update    
    -------- -------      -----------        ----------- ---------                                  ------    
    5023049  15.0.4298.1  February 16, 2023  None        https://support.microsoft.com/help/5023049 CU19      
    5021124  15.0.4280.7  February 14, 2023  None        https://support.microsoft.com/help/5021124 CU18 + GDR
    5021125  15.0.2101.7  February 14, 2023  None        https://support.microsoft.com/help/5021125 GDR       
    5017593  15.0.4261.1  September 28, 2022 None        https://support.microsoft.com/help/5017593 CU18      
    5016394  15.0.4249.2  August 11, 2022    None        https://support.microsoft.com/help/5016394 CU17      
    5014353  15.0.4236.7  June 14, 2022      None        https://support.microsoft.com/help/5014353 CU16 + GDR
    ...
    

    All that's left is to convert it to json

    $data | ConvertTo-Json