Search code examples
vb.netwebscreen-scraping

Scraping specific text from website to Application on VB


I'm trying to create a simple app which is basically used to compare stuff on several websites. I've seen some ways to extract all the text to the app. But is there any way to extract say, only the Title and Description.

Take a book site as an example. Is there anyway to search a book title then show all different reviews, synopsis, prices without having any unusefull text there?


Solution

  • A quick and simple solution is to use a WebBrowser which exposes a HtmlDocument through it's .Document property.

    Public Class Form1
    
        Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
            Me.WebBrowser1.ScriptErrorsSuppressed = True
            Me.WebBrowser1.Navigate(New Uri("http://stackoverflow.com/"))
        End Sub
    
        Private Sub WebBrowser1_DocumentCompleted(sender As Object, e As WebBrowserDocumentCompletedEventArgs) Handles WebBrowser1.DocumentCompleted
    
            Dim document As HtmlDocument = Me.WebBrowser1.Document
            Dim title As String = Me.GetTitle(document)
            Dim description As String = Me.GetMeta(document, "description")
            Dim keywords As String = Me.GetMeta(document, "keywords")
            Dim author As String = Me.GetMeta(document, "author")
    
        End Sub
    
        Private Function GetTitle(document As HtmlDocument) As String
            Dim head As HtmlElement = Me.GetHead(document)
            If (Not head Is Nothing) Then
                For Each el As HtmlElement In head.GetElementsByTagName("title")
                    Return el.InnerText
                Next
            End If
            Return String.Empty
        End Function
    
        Private Function GetMeta(document As HtmlDocument, name As String) As String
            Dim head As HtmlElement = Me.GetHead(document)
            If (Not head Is Nothing) Then
                For Each el As HtmlElement In head.GetElementsByTagName("meta")
                    If (String.Compare(el.GetAttribute("name"), name, True) = 0) Then
                        Return el.GetAttribute("content")
                    End If
                Next
            End If
            Return String.Empty
        End Function
    
        Private Function GetHead(document As HtmlDocument) As HtmlElement
            For Each el As HtmlElement In document.GetElementsByTagName("head")
                Return el
            Next
            Return Nothing
        End Function
    
    End Class