Search code examples

Parsing <div> HTML content with &nbsp;

I have the below monitoring link output which i am trying parse to variable.

<style type="text/css"></style>
<div style="float:left;margin-right:50px">

<div><br><br>&nbsp;&nbsp;&nbsp;&nbsp; DataCenter: DC1 NY [ENABLED]
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Active Zone : BW Zone 1[1], &nbsp;&nbsp;VIP =</div>

<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=></a>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=></a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=></a>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=></a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Local Zones:</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; LC Zone 3[3], &nbsp;&nbsp;VIP =
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=></a>

<div><br><br>&nbsp;&nbsp;&nbsp;&nbsp; DataCenter: DC2 NJ [ENABLED]
&nbsp;[DEFAULT DC]</div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Active Portal Zone : BW Zone 2[2], &nbsp;&nbsp;VIP =</div>

<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=></a>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=></a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=></a>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=></a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Local Zones:</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; LC Zone 4[4], &nbsp;&nbsp;VIP =
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=></a>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=></a>

--> </div>

i would like to parse this to get

Data Center                    Active Zone      VIP             Local Zone   VIP
DC1 NY [Enabled]               BW Zone 1[1]  LC Zone 3[3]
DC2 NJ [Enabled] [DEFAULT DC]  BW Zone 2[2]  LC Zone 4[4] 

The code seems to be not able to parse and is Regex is the best way to parse this page or should i try some other way.

$zone = ""
$html = Invoke-WebRequest -Uri $zone -ErrorAction Stop
$DC = ($html.ParsedHtml.getElementsByTagName('div') |  Where-Object { $_.InnerHTML -like '<div><br><br>&nbsp;&nbsp;&nbsp;&nbsp; DataCenter: *' })  |  Foreach-Object {$_.outerText -replace '(?<!:.*):', '='} | %{new-object psobject -prop (ConvertFrom-StringData $_)}


  • For that you could do this:

    $div = $html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like '<div>*DataCenter:*' }
    $DC = if ($div -and $div.outerText -match '(?s)DataCenter\s*:\s*(\w+).*Active Zone\s*:\s*([^,]+),\s+VIP\s*=\s*([\d\.]+)') {
            'DataCenter'  = $matches[1]
            'Active Zone' = $matches[2]
            'VIP'         = $matches[3]
    $DC | Format-Table -AutoSize


    DataCenter Active Zone VIP         
    ---------- ----------- ---         
    DC1        BW Zone

    or as List

    $DC | Format-List


    DataCenter  : DC1
    Active Zone : BW Zone
    VIP         :

    Here's a different approach when multiple datacenters are in the html file:

    # use outerText to get the plain text for the surrounding <div>DATA CENTERS WITH GLOBAL REPLICATION TIER ENABLED/SUSPENDED ...</div>
    $content = ($html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.innerHtml -like '<div>DATA CENTERS*' }).outerText
    $DC = $content -split 'DataCenter\s*:\s*' |
          Where-Object { $_ -match '(?s)([\w ]+(?:[ [\w\]]*)).*Active (?:Portal )?Zone\s*:\s*([^,]+),\s+VIP\s*=\s*([\d.]+)' } | 
          ForEach-Object { 
                'DataCenter'  = $matches[1]
                'Active Zone' = $matches[2]
                'VIP'         = $matches[3]
    $DC | Format-Table -AutoSize 


    DataCenter                     Active Zone  VIP           
    ----------                     -----------  ---           
    DC1 NY [ENABLED]               BW Zone 1[1]
    DC2 NJ [ENABLED]  [DEFAULT DC] BW Zone 2[2]

    Regex details:

    (?s)                  Match the remainder of the regex with the options: dot matches newline (s)
    (                     Match the regular expression below and capture its match into backreference number 1
       [\w ]              Match a single character present in the list below
                          A word character (letters, digits, etc.)
                          The character “ ”
          +               Between one and unlimited times, as many times as possible, giving back as needed (greedy)
       (?:                Match the regular expression below
          [ [\w\]]        Match a single character present in the list below
                          One of the characters “ [”
                          A word character (letters, digits, etc.)
                          A ] character
             *            Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    .                     Match any single character
       *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    Active\               Match the characters “Active ” literally
    (?:                   Match the regular expression below
       Portal\            Match the characters “Portal ” literally
    )?                    Between zero and one times, as many times as possible, giving back as needed (greedy)
    Zone                  Match the characters “Zone” literally
    \s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
       *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    :                     Match the character “:” literally
    \s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
       *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    (                     Match the regular expression below and capture its match into backreference number 2
       [^,]               Match any character that is NOT a “,”
          +               Between one and unlimited times, as many times as possible, giving back as needed (greedy)
    ,                     Match the character “,” literally
    \s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
       +                  Between one and unlimited times, as many times as possible, giving back as needed (greedy)
    VIP                   Match the characters “VIP” literally
    \s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
       *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    =                     Match the character “=” literally
    \s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
       *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    (                     Match the regular expression below and capture its match into backreference number 3
       [\d.]              Match a single character present in the list below
                          A single digit 0..9
                          The character “.”
          +               Between one and unlimited times, as many times as possible, giving back as needed (greedy)