Search code examples
vbaweb-scrapingxmlhttprequestserverxmlhttp

Macro gets partial response using serverxmlhttp requests


I'm trying to extract street address along with the builder name from a webpage. When I use xmlhttp60 requests I get those fields accordingly. However, when I go for serverxmlhttp60 requests I get partial response most of the times and as a result the script only prints the street adddress. I used json converter to parse builder name out of json content from that site.

Here is the proof of concept:

Sub GrabPropertyInfo()
    Const siteLink$ = "https://www.redfin.com/TX/Austin/604-Amesbury-Ln-78752/unit-2/home/171045975"

    Dim oPost As Object, oData As Object, Html As HTMLDocument
    Dim jsonObject As Object, jsonStr As Object, propertyMainRaw$
    Dim itemStr As Variant, sResp As String, oElem As Object
    Dim propertyContainer As Object, propertyMain As Object
    
    Set Html = New HTMLDocument
    
'    With CreateObject("MSXML2.XMLHTTP")
'        .Open "GET", siteLink, False
'        .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36"
'        .send
'        sResp = .responseText
'        Html.body.innerHTML = .responseText
'    End With
    
    With CreateObject("MSXML2.ServerXMLHTTP.6.0")
        .Open "GET", siteLink, True
        .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36"
        .send
        While .readyState < 4: DoEvents: Wend
        sResp = .responseText
        Html.body.innerHTML = .responseText
    End With

    Debug.Print "Street address: " & Html.querySelector("h1.homeAddress > .street-address").innerText
    
    With CreateObject("VBScript.RegExp")
        .Global = True
        .Pattern = "reactServerState\.InitialContext = (.*);"
        .MultiLine = True
        Set jsonStr = .Execute(sResp)
    End With
    

    itemStr = jsonStr(0).submatches(0)
    Set jsonObject = JsonConverter.ParseJson(itemStr)
    Set propertyMain = jsonObject("ReactServerAgent.cache")("dataCache")("/stingray/api/home/details/mainHouseInfoPanelInfo")("res")
    propertyMainRaw = Replace(propertyMain("text"), "{}&&", "")

    On Error Resume Next
    Set propertyContainer = JsonConverter.ParseJson(propertyMainRaw)("payload")("mainHouseInfo")("amenitiesInfo")("superGroups")
    On Error GoTo 0

    If Not propertyContainer Is Nothing Then
        For Each oElem In propertyContainer
            For Each oPost In oElem("amenityGroups")
                If InStr(oPost("groupTitle"), "Building Information") > 0 Then
                    For Each oData In oPost("amenityEntries")
                        If InStr(oData("amenityName"), "Builder Name") > 0 Then
                            Debug.Print "Builder Name: " & oData("amenityValues")(1)
                        End If
                    Next oData
                End If
            Next oPost
        Next oElem
    End If
End Sub

Using xmlhttp requests, I always get:

Street address: 604 Amesbury Ln #2,
Builder Name: Zach Savage

Using serverxmlhttp requests, I get the following result most of the time:

Street address: 604 Amesbury Ln #2,

How can I get complete response using serverxmlhttp requests?

EDIT:

According to the answer and comments, it is clear that if I scrape browserid from that site using xmlhttp requests and use the value of that browserid as cookie while sending requests using serverxmlhttp, I'll get the desired results. However, The problem is the value of the browserid that I get using xmlhttp requests is sz9u0xmCQKKV9Wu0jRa3Yg whereas I can see this value v-J5D2IUSyqXizI7MG67fQ in page source. How can I get the latter value? This is how I parsed browserid.

Sub FetchBrowserId()
    Const siteLink$ = "https://www.redfin.com/TX/Austin/604-Amesbury-Ln-78752/unit-2/home/171045975"
 
    Dim Rxp As Object, browserId As Object, sRes$, cookie$
 
    Set Rxp = CreateObject("VBScript.RegExp")
 
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", siteLink, False
        .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
        .send
        sRes = .responseText
    End With
 
    With Rxp
        .Global = True
        .Pattern = "window.__rfBrowserId=""(.*?)"";"
        .MultiLine = True
        Set browserId = .Execute(sRes)
    End With
 
    cookie = browserId(0).submatches(0)
    Debug.Print cookie
End Sub

Solution

  • So the actual question I guess is why there are different responses returned by MSXML2.XMLHTTP and MSXML2.ServerXMLHTTP requests made to the same URL.

    MSXML2.XMLHTTP uses WinINet stack and MSXML2.ServerXMLHTTP uses WinHTTP stack. Check WinINet vs. WinHTTP article for more details.

    WinINet provide full processing of cookies (BTW IE also rely on it). So the first reason you have different responses is that cookies sent to server may affect flow. It can be easily compared with any service like e. g. Webhook.site. When you make second request with MSXML2.XMLHTTP the webservice logs the cookies which have been accepted from first response.

    Also take in account SSL conditions. Make requests to How's My SSL? by MSXML2.XMLHTTP and MSXML2.ServerXMLHTTP, and follow the link in browser (i. e. Chrome) and compare results.