Search code examples
powershellpowershell-2.0powershell-3.0powershell-4.0

HTML Table to CSV


I am quite new to powershell, I used this code by JohnLBevan to convert the HTML Table into CSV:

function ConvertFrom-HtmlTableRow {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true)]
        $htmlTableRow
        ,
        [Parameter(Mandatory = $false, ValueFromPipeline = $false)]
        $headers
        ,
        [Parameter(Mandatory = $false, ValueFromPipeline = $false)]
        [switch]$isHeader

    )
    process {
        $cols = $htmlTableRow | select -expandproperty td
        if($isHeader.IsPresent) {
            0..($cols.Count - 1) | %{$x=$cols[$_] | out-string; if(($x) -and ($x.Trim() -gt [string]::Empty)) {$x} else {("Column_{0:0000}" -f $_)}} #clean the headers to ensure each col has a name        
        } else {
            $colCount = ($cols | Measure-Object).Count - 1
            $result = new-object -TypeName PSObject
            0..$colCount | %{
                $colName = if($headers[$_]){$headers[$_]}else{("Column_{0:00000} -f $_")} #in case we have more columns than headers 
                $colValue = $cols[$_]
                $result | Add-Member NoteProperty $colName $colValue
            } 
            write-output $result
        }
    }
}

function ConvertFrom-HtmlTable {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true)]
        $htmlTable
    )
    process {
        #currently only very basic <table><tr><td>...</td></tr></table> structure supported
        #could be improved to better understand tbody, th, nested tables, etc

        #$htmlTable.childNodes | ?{ $_.tagName -eq 'tr' } | ConvertFrom-HtmlTableRow

        #remove anything tags that aren't td or tr (simplifies our parsing of the data
        [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>{0}</root>" -f ($htmlTable | select -ExpandProperty innerHTML | %{(($_ | out-string) -replace '(</?t[rdh])[^>]*(/?>)|(?:<[^>]*>)','$1$2') -replace '(</?)(?:th)([^>]*/?>)','$1td$2'})) 
        [string[]]$headers = $cleanedHtml.root.tr | select -first 1 | ConvertFrom-HtmlTableRow -isHeader
        if ($headers.Count -gt 0) {
            $cleanedHtml.root.tr | select -skip 1 | ConvertFrom-HtmlTableRow -Headers $headers | select $headers
        }
    }
}

But whenever I execute it from parsedHTML variable and getting the elementbytagname "table", I got this error:

Cannot convert value "<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>

</root>" to type "System.Xml.XmlDocument". Error: "The 'Tr' start tag on line 16 position 124 does not match the end tag of 'td'. Line 20, position 3."
At line:108 char:9
+         [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp '  ...
+         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [], RuntimeException
    + FullyQualifiedErrorId : InvalidCastToXmlDocument

I wish someone could help me out. Thanks in advance.

I am trying to work with the external website. This is the HTML Code of the Table:

<table class="organization-admin__table table">
          <thead>
            <tr>
              <th colspan="2">Name</th>
              <th>Email address</th>
              <th>Timezone</th>
              <th>Last logged in</th>
              <th>Actions</th>
            </tr>
          </thead>
          <tbody>
            
            <tr>
              <td width="48px">
  <a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user1 &lt;[email protected]&gt;" title="user1 &lt;[email protected]&gt;">
    <img src="https://portal.website.com/avatar/0fd7f51cee04789c617b1cc973e0b245.jpg?s=64&amp;r=g&amp;d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F87d37c%2Ffff%26text%3DTM" alt="user1 &lt;[email protected]&gt;" width="32" height="32">
  </a>

</td>
              <td><a href="/site/users/samluser">user1</a></td>
              <td><a href="mailto:[email protected]">[email protected]</a></td>
              <td>Canada/Eastern</td>
              <td>05 Aug 2021</td>
              <td>
                <ul class="button-group">
                  
                  
                  <li>
                    <a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
                      <i class="fa fa-pencil-alt"></i>
                      Edit
                    </a>
                  </li>
                  
                  
                  <li>
                    <a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
                      <i class="fa fa-trash-alt"></i>
                      Delete
                    </a>
                  </li>
                  
                </ul>
              </td>
            </tr>
            
            <tr>
              <td width="48px">
  <a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user2 &lt;[email protected]&gt;" title="user2 &lt;[email protected]&gt;">
    <img src="https://portal.website.com/avatar/481355c93fa79e47ca56110da63d6da5.jpg?s=64&amp;r=g&amp;d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F044f67%2Ffff%26text%3DVS" alt="user2 &lt;[email protected]&gt;" width="32" height="32">
  </a>

</td>
              <td><a href="/site/users/samluser">user2</a></td>
              <td><a href="mailto:[email protected]">[email protected]</a></td>
              <td>Canada/Eastern</td>
              <td>16 Jul 2021</td>
              <td>
                <ul class="button-group">
                  
                  
                  <li>
                    <a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
                      <i class="fa fa-pencil-alt"></i>
                      Edit
                    </a>
                  </li>
                  
                  
                  <li>
                    <a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
                      <i class="fa fa-trash-alt"></i>
                      Delete
                    </a>
                  </li>
                  
                </ul>
              </td>
            </tr>
            
            <tr>
              <td width="48px">
  <a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user3 &lt;[email protected]&gt;" title="user3 &lt;[email protected]&gt;">
    <img src="https://portal.website.com/avatar/450f564aaba30e75fe70dc5f4bbefaf6.jpg?s=64&amp;r=g&amp;d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2Fffb61e%2Ffff%26text%3DWP" alt="Wilfred &lt;[email protected]&gt;" width="32" height="32">
  </a>

</td>
              <td><a href="/site/users/samluser">Wilfred</a></td>
              <td><a href="mailto:[email protected]">[email protected]</a></td>
              <td>UTC</td>
              <td>26 Jul 2021</td>
              <td>
                <ul class="button-group">
                  
                  
                  <li>
                    <a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
                      <i class="fa fa-pencil-alt"></i>
                      Edit
                    </a>
                  </li>
                  
                  
                  <li>
                    <a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
                      <i class="fa fa-trash-alt"></i>
                      Delete
                    </a>
                  </li>
                  
                </ul>
              </td>
            </tr>
            
          </tbody>
    </table>

Solution

  • As commented, converting to XML has strict rules and when the HTML neglets to write a closing tag </tr> loading it as xml will fail.. Same goes for the <img> tags that have no end tag </img>.

    I do not have the full html you are loading, but perhaps try below function instead:

    function ConvertFrom_HtmlTable {
        # adapted from: https://www.leeholmes.com/blog/2015/01/05/extracting-tables-from-powershells-invoke-webrequest/
        [CmdletBinding(DefaultParameterSetName = 'ByIndex')]
        param(
            [Parameter(Mandatory = $true, Position = 0)]
            [Microsoft.PowerShell.Commands.HtmlWebResponseObject]$WebRequest,
    
            [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
            [int]$TableIndex = 0,
    
            [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
            [string]$TableId,
    
            [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
            [string]$TableName,
    
            [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
            [string]$TableClassName
        )
    
        # Extract the table out of the web request
        switch ($PSCmdlet.ParameterSetName) {
            'ById'    { $table = $WebRequest.ParsedHtml.getElementByID($TableId) }
            'ByIndex' { $table = @($WebRequest.ParsedHtml.getElementsByTagName('table'))[$TableIndex]}
            'ByName'  { $table = @($WebRequest.ParsedHtml.getElementsByName($TableName))[0] }
            'ByClass' { $table = @($WebRequest.ParsedHtml.getElementsByClassName($TableClassName))[0] }
        }
        if (!$table) {
            Write-Warning "Could not find the given table."
            return $null
        }
    
        # load the System.Web assembly to be able to decode HTML entities
        Add-Type -AssemblyName System.Web
    
        $headers = @()
        # Go through all of the rows in the table
        foreach ($row in $table.Rows) {
            $cells = @($row.Cells)
            # If there is a table header, remember its titles
            if($cells[0].tagName -eq "TH") {
                $i = 0
                $headers = @($cells | ForEach-Object {
                    $i++
                    # decode HTML entities and double-up quotes that the value may contain
                    $th = ([System.Web.HttpUtility]::HtmlDecode($_.InnerText) -replace '"', '""').Trim()
                    # if the table header is empty, create it
                    if ([string]::IsNullOrEmpty($th)) { "H$i" } else { $th }
                })
                # proceed with the next row
                continue
            }
            # if we haven't found any table headers, make up names "H1", "H2", etc.
            if(-not $headers) {
                $headers = @(1..($cells.Count + 2) | ForEach-Object { "H$_" })
            }
    
            # Now go through the cells in the the row. For each, try to find the
            # title that represents that column and create a hashtable mapping those
            # titles to content
            $hash = [Ordered]@{}
            for ($i = 0; $i -lt $cells.Count; $i++) {
                # decode HTML entities and double-up quotes that the value may contain
                $value = ([System.Web.HttpUtility]::HtmlDecode($cells[$i].InnerText) -replace '"', '""').Trim()
                $th = $headers[$i]
                $hash[$th] = $value.Trim()
            }
            # And finally cast that hashtable to a PSCustomObject
            [PSCustomObject]$hash
        }
    }
    

    Call it like this:

    $request = Invoke-WebRequest $uri
    $table = ConvertFrom_HtmlTable -WebRequest $request -TableClassName 'organization-admin__table table'
    

    or use the TableIndex parameter if you know it is the first or xth table in the html, as it apparently has no id or name

    If that succeeds, you can simply write to csv:

    $table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation
    

    From your comment, it seems that you cannot for some reason use Invoke-WebRequest and have to parse using the IE com object.

    Try this version of the function instead:

    function ConvertFrom_HtmlTable {
        [CmdletBinding(DefaultParameterSetName = 'ByIndex')]
        param(
            [Parameter(ValueFromPipeline = $true, Mandatory = $true, Position = 0)]
            [string]$Url,
    
            [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
            [int]$TableIndex = 0,
    
            [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
            [string]$TableId,
    
            [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
            [string]$TableName,
        
            [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
            [string]$TableClassName,
    
            [switch]$FirstRowHasHeaders
        )
    
        $ie = New-Object -ComObject 'InternetExplorer.Application'
        $ie.Visible = $false
        $ie.Silent = $true
        $ie.Navigate($Url)
        # wait for IE to fully load the document
        while($ie.Busy) { Start-Sleep -Milliseconds 100 }
    
        $doc = $ie.Document
    
        switch ($PSCmdlet.ParameterSetName) {
            'ById'    { $table = $doc.IHTMLDocument3_getElementByID($TableId) }
            'ByIndex' { $table = @($doc.IHTMLDocument3_getElementsByTagName('table'))[$TableIndex]}
            'ByName'  { $table = @($doc.IHTMLDocument3_getElementsByName($TableName))[0] }
            'ByClass' { $table = @($doc.IHTMLDocument3_getElementsByClassName($TableClassName))[0] }
        }
    
        if ($table) {
            # Extracting table rows as a collection.
            $tbody = $table.childNodes | Where-Object { $_.tagName -eq "tbody" }
            if ($tbody) {
                $rows = $tbody.childNodes | Where-Object { $_.tagName -eq "tr" }
            }
            else {
                $rows = $table.childNodes | Where-Object { $_.tagName -eq "tr" }
            }
    
            # read or create table headers
            # assume the first row has headers either in <th> or <td> tags
            $firstRow = 1
            $headers = @($rows[0].childNodes | Where-Object { $_.tagName -eq "th" } | Foreach-Object { $_.innerHTML })
    
            if (!($headers)) {
                # there were no <th> tags found, so either use the first row as headers or create from scratch
                $values = @($rows[0].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
                if ($FirstRowHasHeaders) {
                    # the headers are considered to be the values from the first row
                    $headers = $values
                }
                else {
                    # the table has no headers, so dynamically create them
                    $firstRow = 0
                    $headers =  for ($i = 1; $i -le $values.Count; $i++) { "Column_$i" }
                }
            }
    
            # create a List object to store the values found as PSObjects
            $result = [System.Collections.Generic.List[object]]::new()
            for ($i = $firstRow; $i -lt $rows.Count; $i++) {
                $values = @($rows[$i].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
                $valuesCount = $values.Count
                while ($headers.Count -lt $valuesCount) {
                    $colName = "Column_{0}" -f ($headers.Count + 1)
                    $headers += $colName
                    # we have just added a new header column. Make sure the first item also has this new column
                    if ($result.Count) { 
                        $result[0] | Add-Member -MemberType NoteProperty -Name $colName -Value $null
                    }
                }
                # create a Hashtable to get store the values
                $data = [ordered]@{}
                for ($j = 0; $j -lt $valuesCount; $j++) { $data[$headers[$j]] = $values[$j] }
                # add the hash cast to PsCustomObject to the list
                $result.Add(([PsCustomObject]$data))
            }
        }
        else { Write-Warning "Could not find the given table." }
    
        # quit IE and clean up
        $ie.Quit()
        $null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($ie)
        [System.GC]::Collect()
        [System.GC]::WaitForPendingFinalizers()
    
        return $result
    }
    
    $table = ConvertFrom_HtmlTable -Url 'your URL here' -TableClassName 'organization-admin__table'
    $table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation
    

    The 2nd function using the InternetExplorer.Application COM object needs to find the table object using the DOM. For that, the function currently uses the IHTMLDocument3 interface, which for me on Windows 10 Pro, PowerShell 5.1 and IE version 11.789.19041.0 works when I test on for instance

    ConvertFrom_HtmlTable -Url 'https://www.w3schools.com/html/html_tables.asp' -TableId 'customers'
    

    According to your comment, you are receiving error message :

    Method invocation failed because [mshtml.HTMLDocumentClass] does not contain a method named 'IHTMLDocument3_getElementsByClassName'.

    This means you have a different (not updated/ broken) version on your machine and you will have to try for yourself which method works:

    1. test first what version of IE you have by typing this in a PowerShell console:
    (Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').SvcVersion
    

    If that returns blank, try

    (Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').Version
    
    1. Next, inside the switch change the methods used from
    $doc.IHTMLDocument3_getElementsByClassName($TableClassName)
    

    to

    $doc.getElementsByClassName($TableClassName)
    

    or

    $doc.documentElement.getElementsByClassName($TableClassName)
    

    If all of that fails, I'm afraid you have a serious problem on your computer (maybe the reason for Invoke-Webrequest not working also ?). Try to fix that with fsc /scannow