Search code examples
html.netvb.nethtml-parsinghtml-agility-pack

How can I parse repeated HTML elements by their class attribute?


I'm trying to parse an HTML file with basically the same tags.

I want to get this output:

BTC - Bitcoin, BEP20(BSC), Bitcoin(Segwit)

ETH - ERC20, BEP20(BSC), POLYGON, ARBITRUM, AURORA, MATISEVM

USDT - OMNI,TRC20,ERC20,BEP20(BSC),HECO,POLYGON,FTM, AVAX-C ,ARBITRUM,METISEVM

QASH - ERC20

Here is a sample of the HTML:

<div data-v-326d86f4="" class="table-box">
   <table data-v-326d86f4="">
      <tr data-v-326d86f4="">
         <td data-v-326d86f4="">BTC</td>
         <td data-v-326d86f4="" class="block-chain">
            <div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">Bitcoin</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
            <div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">Bitcoin</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">Bitcoin(SegWit)</span></div>
         </td>
         <td data-v-326d86f4="">0.001</td>
         <td data-v-326d86f4="">0.002</td>
      </tr>
      <tr data-v-326d86f4="">
         <td data-v-326d86f4="">ETH</td>
         <td data-v-326d86f4="" class="block-chain">
            <div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">ERC20</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
            <div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">ERC20</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">POLYGON</span><span data-v-326d86f4="">ARBITRUM</span><span data-v-326d86f4="">AURORA</span><span data-v-326d86f4="">METISEVM</span></div>
         </td>
         <td data-v-326d86f4="">0.012</td>
         <td data-v-326d86f4="">0.024</td>
      </tr>
      <tr data-v-326d86f4="">
         <td data-v-326d86f4="">USDT</td>
         <td data-v-326d86f4="" class="block-chain">
            <div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">OMNI</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
            <div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">OMNI</span><span data-v-326d86f4="">TRC20</span><span data-v-326d86f4="">ERC20</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">HECO</span><span data-v-326d86f4="">POLYGON</span><span data-v-326d86f4="">FTM</span><span data-v-326d86f4="">AVAX-C</span><span data-v-326d86f4="">ARBITRUM</span><span data-v-326d86f4="">METISEVM</span></div>
         </td>
         <td data-v-326d86f4="">30</td>
         <td data-v-326d86f4="">50</td>
      </tr>
      <tr data-v-326d86f4="">
         <td data-v-326d86f4="">QASH</td>
         <td data-v-326d86f4="" class="block-chain">
            <div data-v-326d86f4="" class="chain_box">
               <span data-v-326d86f4="" class="chain_name">ERC20</span> <!---->
            </div>
            <!---->
         </td>
         <td data-v-326d86f4="">513</td>
         <td data-v-326d86f4="">1026</td>
      </tr>
      <!-- ... -->

I'm using the HtmlAgilityPack library without success:

Dim arqHtml As String = "C:\Users\Mattia\Desktop\ready.html"
Dim myHtml As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument()
myHtml.Load(arqHtml)
Dim myTable As HtmlAgilityPack.HtmlNode = myHtml.DocumentNode.SelectSingleNode("//table")
Dim myRows As HtmlAgilityPack.HtmlNodeCollection = myTable.SelectNodes("tr")
For Each tmpRow As HtmlAgilityPack.HtmlNode In myRows
    Dim myCells As HtmlAgilityPack.HtmlNodeCollection = tmpRow.SelectNodes("td")
    If myCells IsNot Nothing Then
        Dim myToken As String = myCells(0).InnerText
        Dim mySpans As HtmlAgilityPack.HtmlNodeCollection = myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")
        If mySpans IsNot Nothing Then
            Dim myListBChain As New List(Of String)
            For Each mySpan As HtmlAgilityPack.HtmlNode In mySpans
                RichTextBox1.Text += mySpan.InnerText
            Next
            Dim allItensAsString = String.Join(", ", richtextbox1.text)
        End If
    End If
Next

This returns this output:

BitcoinBEP20(BSC)Bitcoin(SegWit)ERC20BEP20(BSC)POLYGONARBITRUMAURORAMETISEVMOMNITRC20ERC20BEP20(BSC)HECOPOLYGONFTMAVAX-CARBITRUMMETISEVMEOSBEP20(BSC)ERC20BEP20(BSC)TRC20BEP20(BSC)ZILBEP20(BSC)NEOLEGACYNEON3ERC20POLYGONERC20DAGBEP2BEP20(BSC)FTMAVAX-CERC20BEP20(BSC)ERC20BEP20(BSC)ERC20HECOBEP20(BSC)ERC20HECOERC20POLYGONERC20HECOERC20POLYGONERC20BEP20(BSC)BCHBEP20(BSC)ERC20LOOPPOLYGONBEP20(BSC)FTMAVAX-CMETISEVMERC20TOLERC20METAERC20BEP20(BSC)

How do I make it return the output I want?


Solution

  • Incorporating my comment on the original issue, in the last <tr> in the sample...

    <tr data-v-326d86f4="">
        <td data-v-326d86f4="">QASH</td>
        <td data-v-326d86f4="" class="block-chain">
        <div data-v-326d86f4="" class="chain_box">
            <span data-v-326d86f4="" class="chain_name">ERC20</span> <!---->
        </div>
        <!---->
        </td>
        <td data-v-326d86f4="">513</td>
        <td data-v-326d86f4="">1026</td>
    </tr>
    

    ...the second <td> does not contain a <div class="select-list" ... >, so...

    myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")
    

    ...returns Nothing, hence the NullReferenceException.

    As far as building the output you want, first you need to test if such a <div class="select-list" ... > element exists...

    If mySpans Is Nothing Then
    

    If it doesn't, then save the contents of the <div class="chain_box" ... ><span class="chain_name ... > element...

    Dim chainTextNode As HtmlAgilityPack.HtmlNode = myCells(1).SelectSingleNode(
        "div[contains(@class, 'chain_box')]/span[contains(@class, 'chain_name')]"
    )
    
    chainText = If(chainTextNode Is Nothing OrElse String.IsNullOrWhiteSpace(chainTextNode.InnerText), "(unknown)", chainTextNode.InnerText)
    

    I added a little extra handling just in case that element doesn't exist or have a value.

    If there is a <div class="select-list" ... > element, then save the values of its child <span ... > elements separated by commas...

    chainText = String.Join(", ", mySpans.Select(Function(span) span.InnerText))
    ' Alternative: chainText = String.Join(", ", From span In mySpans Select span.InnerText)
    

    Finally, build and append a new line to your text box...

    RichTextBox1.Text &= $"{myToken} - {chainText}{Environment.NewLine}"
    

    The complete code looks like this...

    Dim arqHtml As String = "C:\Users\Mattia\Desktop\ready.html"
    Dim myHtml As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument()
    myHtml.Load(arqHtml)
    Dim myTable As HtmlAgilityPack.HtmlNode = myHtml.DocumentNode.SelectSingleNode("//table")
    
    Dim myRows As HtmlAgilityPack.HtmlNodeCollection = myTable.SelectNodes("tr")
    For Each tmpRow As HtmlAgilityPack.HtmlNode In myRows
        Dim myCells As HtmlAgilityPack.HtmlNodeCollection = tmpRow.SelectNodes("td")
        If myCells IsNot Nothing Then
            Dim myToken As String = myCells(0).InnerText
            Dim mySpans As HtmlAgilityPack.HtmlNodeCollection = myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")
            Dim chainText As String
    
            If mySpans Is Nothing Then
                Dim chainTextNode As HtmlAgilityPack.HtmlNode = myCells(1).SelectSingleNode(
                    "div[contains(@class, 'chain_box')]/span[contains(@class, 'chain_name')]"
                )
    
                chainText = If(chainTextNode Is Nothing OrElse String.IsNullOrWhiteSpace(chainTextNode.InnerText), "(unknown)", chainTextNode.InnerText)
            Else
                chainText = String.Join(", ", mySpans.Select(Function(span) span.InnerText))
                ' Alternative: chainText = String.Join(", ", From span In mySpans Select span.InnerText)
            End If
    
            RichTextBox1.Text &= $"{myToken} - {chainText}{Environment.NewLine}"
        End If
    Next
    

    If you have a very large input HTML file you might consider...

    • Appending each iteration's line to a StringBuilder...
      outputBuilder.Append($"{myToken} - {chainText}{Environment.NewLine}")
      
      ...and then setting RichTextBox1.Text once after the loop...
      RichTextBox1.Text = outputBuilder.ToString()
      
    • (Assuming WinForms) Calling RichTextBox1.SuspendLayout() before the loop and RichTextBox1.ResumeLayout() after the loop

    ...to improve performance, however, using either or both approaches means RichTextBox1 won't display any output until the HTML is completely processed.