Search code examples
vb.netvariablesweb-scrapinghtml-agility-pack

Scrape text from WebBrowser activity using HTML agility pack (VB.net)


I want to extract fields/text in a WebBrowser activity in Windows form using HTML agility pack. I'm able to scrape text in the background but want to do it in the WebBrowser inside my form.

I tried referencing my HtmlDocument variable to WebBrowser1.Document but it seems I cannot convert it.

This is the error I'm encountering

enter image description here

And these are the variable type

enter image description here

Here's my code.

Imports System
Imports System.Xml
Imports HtmlAgilityPack


Public Class Form1

    Private Sub Form1_load(sender As System.Object, e As EventArgs) Handles MyBase.Load

        WebBrowser1.Navigate(TextBox3.Text)

    End Sub

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

        Dim link As String = TextBox3.Text
        Dim doc As HtmlDocument = New HtmlWeb().Load(link)
        Dim web_document As HtmlDocument = WebBrowser1.Document

        Dim name As HtmlNode = doc.DocumentNode.SelectSingleNode("//*[@id='details']/div[2]/div[2]/div/div[1]/h3")
        'if the div is found, print the inner text'
        If Not name Is Nothing Then
            TextBox1.Text = name.InnerText.Trim()

        End If


        Dim customer_number As HtmlNode = doc.DocumentNode.SelectSingleNode("//*[@id='details']/div[2]/div[2]/div/div[2]/dl[4]/dd")
        'if the div is found, print the inner text'
        If Not customer_number Is Nothing Then
            TextBox2.Text = customer_number.InnerText.Trim()

        End If

        MessageBox.Show("Doc variable: " + doc.GetType.ToString + Environment.NewLine + "web_document variable: " + web_document.GetType.ToString)

    End Sub

    Private Sub WebBrowser1_DocumentCompleted(sender As Object, e As WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted

    End Sub
End Class

Solution

  • The problem is WebBrowser1.Document returns a Windows.Forms.HtmlDocument, which is not the same as HtmlAgilityPack.HtmlDocument.

    If you want to use HtmlAgilityPack to scrape HTML from a web page in a WebBrowser control, you need to get the DocumentText from the browser control and load it into a new HtmlAgilityPack.HtmlDocument instance like this:

    Dim doc As New HtmlAgilityPack.HtmlDocument()
    doc.LoadHtml(WebBrowser1.DocumentText)