Search code examples
javascripthtmlvbaweb-scrapingdata-extraction

Wrapper value extraction using MSXML2.XMLHTTP


We are currently extracting data from webpage using MSXML2.XMLHTTP.Using my code all data has been extracted except rvw-cnt-tx class data.I want to extract 43 Review value from following url.

url="https://www.trendyol.com/lc-waikiki/erkek-cocuk-lacivert-takim-p-78215759?boutiqueId=555784&merchantId=4171"

webpage html:

<a href="/lc-waikiki/erkek-cocuk-lacivert-takim-p-78215759/yorumlar?boutiqueId=555784&amp;merchantId=4171&amp;v=11-12-yas" class="rvw-cnt-tx">43 Reviews </a>

My code

Set http = CreateObject("MSXML2.XMLHTTP")                                   
http.Open "GET", url, False                                                 
http.Send                                                                   
html.body.innerHTML = http.ResponseText                                     
html1 = html.body.innerHTML                                                 
brand = html.body.innerText                                                 
Dim reviews As String                                                       
cat = html.getElementsByClassName("breadcrumb full-width")(0).innerText     
reviews = html.getElementsByClassName("rvw-cnt-tx")(0).innerText            

Solution

  • It is retrieved dynamically. However, you can concatenate /yorumlar onto end of your current url to get to the reviews page and there the value is present statically. I use regex to extract the number part of the text where number of reviews is present.

    This html.querySelector(".title h3") is to restrict regex to searching just the string from a node where that value is present.

    Option Explicit
    
    Public Sub GetReviewCount()
        'tools > references > Microsoft HTML Object Library
        Dim re As Object, html As MSHTML.HTMLDocument,  xhr As Object
    
        Set re = CreateObject("VBScript.RegExp")
        Set xhr = CreateObject("MSXML2.XMLHTTP")
        Set html = New MSHTML.HTMLDocument
        re.Pattern = "([0-9,]+)"
        
        With xhr
            .Open "GET", "https://www.trendyol.com/lc-waikiki/erkek-cocuk-lacivert-takim-p-78215759/yorumlar", False
            .setRequestHeader "User-Agent", "Mozilla/5.0"
            .send
            html.body.innerhtml = .responseText
        End With
        Debug.Print re.Execute(html.querySelector(".title h3").innerText)(0).SubMatches(0)
    End Sub
    

    To get your cat variable correctly:

    Option Explicit
    
    Public Sub GetCat()
        'tools > references > Microsoft HTML Object Library
        Dim html As MSHTML.HTMLDocument, xhr As Object
    
        Set xhr = CreateObject("MSXML2.XMLHTTP")
        Set html = New MSHTML.HTMLDocument
    
        With xhr
            .Open "GET", "https://www.trendyol.com/lc-waikiki/erkek-cocuk-lacivert-takim-p-78215759?boutiqueId=555784&merchantId=4171", False
            .setRequestHeader "User-Agent", "Mozilla/5.0"
            .send
            html.body.innerhtml = .responseText
        End With
        
        Dim nodes As Object, cat As String, i As Long
        
        Set nodes = html.querySelectorAll(".breadcrumb .breadcrumb-item")
        For i = 0 To nodes.Length - 1
            cat = cat & IIf(i = nodes.Length - 1, nodes.Item(i).innerText, nodes.Item(i).innerText & " > ")
        Next
        Debug.Print cat
    End Sub