Search code examples
excelvbaweb-scrapingxmlhttprequestqueryselector

How to skip a row in Excel with missing html tag using VBA


There are 15 objects listed on this website, each has a link under the photo. The 6th object has none. When extracting and transferring the content with my code the missing html-href is not skipped and in Excel, 14 hrefs are listed below each other (the 6th cell should remain empty or "no ducument"), but the last cell does (& error because 14<=>15). Unfortunately I have to keep my code structure and just need a loop or condition to complete it. Does anyone have any ideas? Thanks.

My incomplete code:

Public Sub GetData()

    Dim html As New HTMLDocument
    Dim elmt01 As Object, elmt02 As Object
    Dim y As Long
    Dim xURL As String

    Set html = New MSHTML.HTMLDocument
    xURL = "https://immobilienpool.de/suche/immobilien?page=1"
    
With CreateObject("MSXML2.XMLHTTP.6.0")
    .Open "GET", xURL, False
    .send
     html.body.innerHTML = .responseText
End With

Set elmt01 = html.querySelectorAll("li[class*='contentBox']")    '15 items
Set elmt02 = html.querySelectorAll("li a[title*='zusätzliche']") '14 hrefs

For y = 0 To elmt01.Length - 1

  If InStr(elmt02, "pdf") Then  'better: If elmt02 exists in elmt01 then...
    ActiveSheet.Cells(y + 1, 2) = elmt02.Item(y).href
  Else
    ActiveSheet.Cells(y + 1, 2) = "No document"
  End If

Next

End Sub

Solution

  • The following script should solve the issue you are having. I had to modify your code to skip the blank row. I hope you will be able to comply with the current version:

    Public Sub GetData()
        Dim Html As HTMLDocument, HTMLDoc As HTMLDocument
        Dim oPdfLink As Object, xURL As String, I As Long
    
        Set Html = New MSHTML.HTMLDocument
        Set HTMLDoc = New MSHTML.HTMLDocument
        
        xURL = "https://immobilienpool.de/suche/immobilien?page=1"
        
        With CreateObject("MSXML2.XMLHTTP.6.0")
            .Open "GET", xURL, False
            .send
             Html.body.innerHTML = .responseText
        End With
    
        With Html.querySelectorAll("li[class*='contentBox']")
            For I = 0 To .Length - 1
                HTMLDoc.body.innerHTML = .item(I).outerHTML
                Set oPdfLink = HTMLDoc.querySelector("a[title*='zusätzliche']")
                
                If Not oPdfLink Is Nothing Then
                    ActiveSheet.Cells(I + 1, 2) = oPdfLink.href
                Else:
                    ActiveSheet.Cells(I + 1, 2) = "No document"
                End If
            Next I
        End With
    End Sub