Search code examples
powershellpowershell-4.0

Parsing <div> HTML content with &nbsp;


I have the below monitoring link output which i am trying parse to variable.

<html>
<head>
<style type="text/css"></style>
</head>
<body>
<div style="float:left;margin-right:50px">
<div>DATA CENTERS WITH GLOBAL REPLICATION TIER ENABLED/SUSPENDED:


<div><br><br>&nbsp;&nbsp;&nbsp;&nbsp; DataCenter: DC1 NY [ENABLED]
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Active Zone : BW Zone 1[1], &nbsp;&nbsp;VIP = 192.168.254.10</div>

<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.10/checkGlobalReplicationTier>https://192.168.254.10/checkGlobalReplicationTier</a>
&nbsp;&nbsp;[ACTIVE]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.10/checkReplication>https://192.168.254.10/checkReplication</a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.11/checkGlobalReplicationTier>https://192.168.254.11/checkGlobalReplicationTier</a>
&nbsp;&nbsp;[STANDBY]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.11/checkReplication>https://192.168.254.11/checkReplication</a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Local Zones:</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; LC Zone 3[3], &nbsp;&nbsp;VIP = 192.168.254.13
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.13/checkReplication>https://192.168.254.13/checkReplication</a>
&nbsp;&nbsp;[ACTIVE]</div>


<div><br><br>&nbsp;&nbsp;&nbsp;&nbsp; DataCenter: DC2 NJ [ENABLED]
&nbsp;[DEFAULT DC]</div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Active Portal Zone : BW Zone 2[2], &nbsp;&nbsp;VIP = 192.168.253.10</div>

<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.10/checkGlobalReplicationTier>https://192.168.253.10/checkGlobalReplicationTier</a>
&nbsp;&nbsp;[ACTIVE]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.10/checkReplication>https://192.168.253.10/checkReplication</a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.11/checkGlobalReplicationTier>https://192.168.253.11/checkGlobalReplicationTier</a>
&nbsp;&nbsp;[STANDBY]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.11/checkReplication>https://192.168.253.11/checkReplication</a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Local Zones:</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; LC Zone 4[4], &nbsp;&nbsp;VIP = 192.168.253.13
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.13/checkReplication>https://192.168.253.13/checkReplication</a>
&nbsp;&nbsp;[ACTIVE]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.14/checkReplication>https://192.168.253.14/checkReplication</a>
&nbsp;&nbsp;[STANDBY]</div>



--> </div>
</div>
</body>
</html>

i would like to parse this to get

Data Center                    Active Zone      VIP             Local Zone   VIP
DC1 NY [Enabled]               BW Zone 1[1]   192.168.254.10  LC Zone 3[3]  192.168.254.13
DC2 NJ [Enabled] [DEFAULT DC]  BW Zone 2[2]   192.168.253.10  LC Zone 4[4]  192.168.253.13 

The code seems to be not able to parse and is Regex is the best way to parse this page or should i try some other way.

$zone = "https://192.168.0.90/checkConfiguration"
$html = Invoke-WebRequest -Uri $zone -ErrorAction Stop
$DC = ($html.ParsedHtml.getElementsByTagName('div') |  Where-Object { $_.InnerHTML -like '<div><br><br>&nbsp;&nbsp;&nbsp;&nbsp; DataCenter: *' })  |  Foreach-Object {$_.outerText -replace '(?<!:.*):', '='} | %{new-object psobject -prop (ConvertFrom-StringData $_)}

Solution

  • For that you could do this:

    $div = $html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like '<div>*DataCenter:*' }
    $DC = if ($div -and $div.outerText -match '(?s)DataCenter\s*:\s*(\w+).*Active Zone\s*:\s*([^,]+),\s+VIP\s*=\s*([\d\.]+)') {
        [PsCustomObject]@{
            'DataCenter'  = $matches[1]
            'Active Zone' = $matches[2]
            'VIP'         = $matches[3]
        }
    }
    
    $DC | Format-Table -AutoSize
    

    Output:

    DataCenter Active Zone VIP         
    ---------- ----------- ---         
    DC1        BW Zone     192.168.0.95
    

    or as List

    $DC | Format-List
    

    Output:

    DataCenter  : DC1
    Active Zone : BW Zone
    VIP         : 192.168.0.95
    

    Here's a different approach when multiple datacenters are in the html file:

    # use outerText to get the plain text for the surrounding <div>DATA CENTERS WITH GLOBAL REPLICATION TIER ENABLED/SUSPENDED ...</div>
    $content = ($html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.innerHtml -like '<div>DATA CENTERS*' }).outerText
    $DC = $content -split 'DataCenter\s*:\s*' |
          Where-Object { $_ -match '(?s)([\w ]+(?:[ [\w\]]*)).*Active (?:Portal )?Zone\s*:\s*([^,]+),\s+VIP\s*=\s*([\d.]+)' } | 
          ForEach-Object { 
            [PsCustomObject]@{
                'DataCenter'  = $matches[1]
                'Active Zone' = $matches[2]
                'VIP'         = $matches[3]
            }
          }
    
    $DC | Format-Table -AutoSize 
    

    Output:

    DataCenter                     Active Zone  VIP           
    ----------                     -----------  ---           
    DC1 NY [ENABLED]               BW Zone 1[1] 192.168.254.10
    DC2 NJ [ENABLED]  [DEFAULT DC] BW Zone 2[2] 192.168.253.10
    

    Regex details:

    (?s)                  Match the remainder of the regex with the options: dot matches newline (s)
    (                     Match the regular expression below and capture its match into backreference number 1
       [\w ]              Match a single character present in the list below
                          A word character (letters, digits, etc.)
                          The character “ ”
          +               Between one and unlimited times, as many times as possible, giving back as needed (greedy)
       (?:                Match the regular expression below
          [ [\w\]]        Match a single character present in the list below
                          One of the characters “ [”
                          A word character (letters, digits, etc.)
                          A ] character
             *            Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
       )                 
    )                    
    .                     Match any single character
       *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    Active\               Match the characters “Active ” literally
    (?:                   Match the regular expression below
       Portal\            Match the characters “Portal ” literally
    )?                    Between zero and one times, as many times as possible, giving back as needed (greedy)
    Zone                  Match the characters “Zone” literally
    \s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
       *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    :                     Match the character “:” literally
    \s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
       *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    (                     Match the regular expression below and capture its match into backreference number 2
       [^,]               Match any character that is NOT a “,”
          +               Between one and unlimited times, as many times as possible, giving back as needed (greedy)
    )                    
    ,                     Match the character “,” literally
    \s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
       +                  Between one and unlimited times, as many times as possible, giving back as needed (greedy)
    VIP                   Match the characters “VIP” literally
    \s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
       *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    =                     Match the character “=” literally
    \s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
       *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    (                     Match the regular expression below and capture its match into backreference number 3
       [\d.]              Match a single character present in the list below
                          A single digit 0..9
                          The character “.”
          +               Between one and unlimited times, as many times as possible, giving back as needed (greedy)
    )