Search code examples
vb.nethrefhtml-agility-packinnertext

VB.net Getting the InnerText of href using HtmlAgilityPack


I have now updated my code (Thanks Tim for helping me learn) which is already working but it doesn't give me the right links i want.

Here is my working code:

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
        Dim webClient As New System.Net.WebClient
        Dim WebSource As String = webClient.DownloadString("http://www.google.com.ph/search?hl=en&as_q=test&as_epq=&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=ctr%3AcountryCA&as_filetype=&as_rights=#as_qdr=all&cr=countryCA&fp=1&hl=en&lr=&q=test&start=20&tbs=ctr:countryCA")

    Dim doc = New HtmlAgilityPack.HtmlDocument()
        doc.LoadHtml(WebSource)
        Dim links = GetLinks(doc, "test")
        For Each Link In links
            ListBox1.Items.Add(Link.ToString())
        Next
    End Sub


   Public Class Link
        Public Sub New(Uri As Uri, Text As String)
            Me.Uri = Uri
            Me.Text = Text
        End Sub
        Public Property Text As String
        Public Property Uri As Uri

        Public Overrides Function ToString() As String
            Return String.Format(If(Uri Is Nothing, "", Uri.ToString()))
        End Function
    End Class


    Public Function GetLinks(doc As HtmlAgilityPack.HtmlDocument, linkContains As String) As List(Of Link)
        Dim uri As Uri = Nothing
        Dim linksOnPage = From link In doc.DocumentNode.Descendants()
                          Where link.Name = "a" _
                          AndAlso link.Attributes("href") IsNot Nothing _
                          Let text = link.InnerText.Trim()
                          Let url = link.Attributes("href").Value
                          Where url.IndexOf(linkContains, StringComparison.OrdinalIgnoreCase) >= 0 _
                          AndAlso uri.TryCreate(url, UriKind.Absolute, uri)

        Dim Uris As New List(Of Link)()
        For Each link In linksOnPage
            Uris.Add(New Link(New Uri(link.url, UriKind.Absolute), link.text))
        Next

        Return Uris
    End Function

I am currently new to this HtmlAgilityPack, I am still learning please bear with me.

My Main Goal:

Sample link: http://www.google.com.ph/search?hl=en&as_q=test&as_epq=&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=ctr%3AcountryCA&as_filetype=&as_rights=#as_qdr=all&cr=countryCA&fp=1&hl=en&lr=&q=test&start=20&tbs=ctr:countryCA

My expected link outputs which contains the word "test":

www.copetest.com/‎
www.testofhumanity.com/
www3.algonquincollege.com/testcentre/‎
www.lpitest.ca/‎
testtube.nfb.ca/‎
www.ieltscanada.ca/testdates.jsp‎
https://www.awinfosys.com/eassessment/fsa_fieldtest.htm‎

Solution

  • You shoud use the attribute href instead, also note that .NET is case-sensitive by default

    For Each link As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//a[@href]")
       Dim href = link.Attributes("href").Value
       If href.IndexOf("test", StringComparison.OrdinalIgnoreCase) >= 0 Then
           ListBox1.Items.Add(href)
           ' or
           ListBox1.Items.Add(link.InnerText)
       End If
    Next 
    

    Here is a method that should return all links in a document as List(Of Link). Link is a custom class with two perties, one for the text and the other for the Uri:

    Public Class Link
        Public Sub New(Uri As Uri, Text As String)
            Me.Uri = Uri
            Me.Text = Text
        End Sub
        Public Property Text As String
        Public Property Uri As Uri
    
        Public Overrides Function ToString() As String
            Return String.Format("{0} [{1}]", Text, If(Uri Is Nothing, "", Uri.ToString()))
        End Function
    End Class
    
    Public Function GetLinks(doc As HtmlAgilityPack.HtmlDocument) As List(Of Link)
        Dim uri As Uri = Nothing
        Dim linksOnPage = From link In doc.DocumentNode.Descendants()
                          Where link.Name = "a" _
                          AndAlso link.Attributes("href") IsNot Nothing _
                          Let text = link.InnerText.Trim()
                          Let url = link.Attributes("href").Value
                          Where uri.TryCreate(url, UriKind.Absolute, uri)
    
        Dim Uris As New List(Of Link)()
        For Each link In linksOnPage
            Uris.Add(New Link(New Uri(link.url, UriKind.Absolute), link.text))
        Next
    
        Return Uris
    End Function
    

    Here is the requested overload that checks if an url contains a given text:

    Public Function GetLinks(doc As HtmlAgilityPack.HtmlDocument, linkContains As String) As List(Of Link)
        Dim uri As Uri = Nothing
        Dim linksOnPage = From link In doc.DocumentNode.Descendants()
                          Where link.Name = "a" _
                          AndAlso link.Attributes("href") IsNot Nothing _
                          Let text = link.InnerText.Trim()
                          Let url = link.Attributes("href").Value
                          Where url.IndexOf(linkContains, StringComparison.OrdinalIgnoreCase) >= 0 _
                          AndAlso uri.TryCreate(url, UriKind.Absolute, uri)
    
        Dim Uris As New List(Of Link)()
        For Each link In linksOnPage
            Uris.Add(New Link(New Uri(link.url, UriKind.Absolute), link.text))
        Next
    
        Return Uris
    End Function
    

    Edited now tested, works, use it in the following way:

    Dim site = File.ReadAllText("C:\Temp\website_test.htm")
    Dim doc = New HtmlAgilityPack.HtmlDocument()
    doc.LoadHtml(site)
    Dim links = GetLinks(doc)
    For Each Link In links
        ListBox1.Items.Add(Link.ToString())
    Next