Search code examples
excelvbaweb-scrapingxmlhttprequest

VBA - XMLHTTP and WinHttp request speed


Below are declared variables for 3 requests which I implement in my macros. I listed libraries they use and their late bindings in comments:

Dim XMLHTTP As New MSXML2.XMLHTTP 'Microsoft XML, v6.0 'Set XMLHTTP = CreateObject("MSXML2.XMLHTTP.6.0")
Dim ServerXMLHTTP As New MSXML2.ServerXMLHTTP 'Microsoft XML, v6.0 'Set ServerXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP.6.0")
Dim http As New WinHttpRequest 'Microsoft WinHttp Services, version 5.1 'Set http = CreateObject("WinHttp.WinHttpRequest.5.1")

I have a few old web scraping macros which used Internet Explorer automation. I wanted to clean coding and speed them up with these requests.

Unfortunately what I have noticed, MSXML2.ServerXMLHTTP and WinHttpRequest are slower on online store's 20 products test (34 and 35 sec) than IE automation with pictures and active scripting off (24 sec)! MSXML2.XMLHTTP executes in 18 secs. I used to see situations when some out of these 3 requests are 2-3 times faster / slower than the other ones, so I always test which one performs best, but never before had any request lost to IE automation.

The main page with results is below, it's all results on one page, 1500+ of them, so request takes some time (6500 pages if pasted to MS Word):

www.justbats.com/products/bat type~baseball/?sortBy=TotalSales Descending&page=1&size=2400

Then I open individual links from main result page:

http://www.justbats.com/product/2017-marucci-cat-7-bbcor-baseball-bat--mcbc7/24317/

I would like to know if these 3 requests are all options I have to get data from websites without browser automation. Also - how possibly browser automation can beat some of these requests?

UPDATE

I have tested the main result page with procedure provided in answer by Robin Mackenzie, clearing IE cache before running it. At least on this particular page, caching seemed to have no explicit gain, as subsequent requests yielded a similar result. IE had active scripting disabled and no images loading.

IE automation method, Document length: 7593346 chars, Processed in: 8 seconds

WinHTTP method,  Document length: 7824059 chars, Processed in: 29 seconds

XML HTTP method, Document length: 7830217 chars, Processed in: 4 seconds

Server XML HTTP method, Document length: 7823958 chars, Processed in: 26 seconds

URL download file method, Document length: 7830346 chars, Processed in: 7 seconds

Very surprising for me is the difference in amount of characters returned by these methods.


Solution

  • In addition to the methods you've mentioned:

    • IE automation
    • WinHTTPRequest
    • XMLHTTP
    • ServerXMLHTTP

    There are 2 other methods you can think about:

    • Using the CreateDocumentFromUrl method of the MSHTML.HTMLDocument object
    • Using the Windows API function URLDownloadToFileA

    There are some other Windows APIs that I am ignoring such as InternetOpen, InternetOpenUrl etc as potential performance will be outweighed by complexity of guess the response length, buffering the response, and so forth.

    CreateDocumentFromUrl

    With the CreateDocumentFromUrl method it is a problem with your sample website because it attempts to create a HTMLDocument in an frame which is not allowed with errors such as:

    Framing Forbidden

    and

    To help protect the security of information you enter into this website, the publisher of this content does not allow it to be displayed in a frame.

    So we should not use this method.

    URLDownloadToFileA

    I thought you need the equivalent of file_get_contents and found this method. It is easily used (check this link) and out-performs the other methods when used on a large request (e.g. try it when you go for >2000 baseball bats). The XMLHTTP also method uses the URLMon library so I guess this way is just cutting out a bit of middle-man logic and obviously there's a downside because you have to do some file system handling.

    Private Declare Function URLDownloadToFileA Lib "urlmon" (ByVal pCaller As Long, _
        ByVal szURL As String, ByVal szFileName As String, ByVal dwReserved As Long, _
        ByVal lpfnCB As Long) As Long
    
    Sub TestUrlDownloadFile(strUrl As String)
    
        Dim dteStart As Date
        Dim dteFinish As Date
        Dim strTempFileName As String
        Dim strResponse As String
        Dim objFso As FileSystemObject
    
        On Error GoTo ExitFunction
    
        dteStart = Now
        strTempFileName = "D:\foo.txt"
        DownloadFile strUrl, strTempFileName
        Set objFso = New FileSystemObject
        With objFso.OpenTextFile(strTempFileName, ForReading)
            strResponse = .ReadAll
            .Close
        End With
        objFso.DeleteFile strTempFileName
        dteFinish = Now
    
        Debug.Print "URL download file method"
        Debug.Print "Document length: " & Len(strResponse) & " chars"
        Debug.Print "Processed in: " & Format(dteFinish - dteStart, "s") & " seconds"
        Debug.Print VBA.vbNewLine
    
    ExitFunction:
        If Err.Number <> 0 Then
            Debug.Print Err.Description
        End If
    
    End Sub
    
    'http://www.vbaexpress.com/forum/archive/index.php/t-27050.html
    Private Function DownloadFile(URL As String, LocalFilename As String) As Boolean
      Dim lngRetVal As Long
      lngRetVal = URLDownloadToFileA(0, URL, LocalFilename, 0, 0)
      If lngRetVal = 0 Then DownloadFile = True
    End Function
    

    With the URLDownloadToFileA it is taking me about 1-2 seconds to download you sample URL versus 4-5 seconds with the XMLHTTP method (full code below).

    The URL:

    www.justbats.com/products/bat type~baseball/?sortBy=TotalSales Descending&page=1&size=2400

    This is the output:

    Testing...
    
    
    XML HTTP method
    Document length: 7869753 chars
    Processed in: 4 seconds
    
    
    URL download file method
    Document length: 7869753 chars
    Processed in: 1 seconds
    

    Code

    This includes all methods discussed e.g. IE automation, WinHTTPRequest, XMLHTTP, ServerXMLHTTP, CreateDocumentFromURL and URLDownloadFile.

    You need all these references in project:

    enter image description here

    Here it is:

    Option Explicit
    
    Private Declare Function URLDownloadToFileA Lib "urlmon" (ByVal pCaller As Long, _
        ByVal szURL As String, ByVal szFileName As String, ByVal dwReserved As Long, _
        ByVal lpfnCB As Long) As Long
    
    Sub Test()
    
        Dim strUrl As String
    
        strUrl = "http://www.justbats.com/products/bat type~baseball/?sortBy=TotalSales Descending&page=1&size=2400"
    
        Debug.Print "Testing..."
        Debug.Print VBA.vbNewLine
    
        'TestIE strUrl
        'TestWinHHTP strUrl
        TestXMLHTTP strUrl
        'TestServerXMLHTTP strUrl
        'TestCreateDocumentFromUrl strUrl
        TestUrlDownloadFile strUrl
    
    End Sub
    
    Sub TestIE(strUrl As String)
    
        Dim dteStart As Date
        Dim dteFinish As Date
        Dim objIe As InternetExplorer
        Dim objHtml As MSHTML.HTMLDocument
        Dim strResponse As String
    
        On Error GoTo ExitFunction
    
        dteStart = Now
        Set objIe = New SHDocVw.InternetExplorer
        With objIe
            .navigate strUrl
            .Visible = False
            While .Busy Or .readyState <> READYSTATE_COMPLETE
               DoEvents
            Wend
            Set objHtml = .document
            strResponse = objHtml.DocumentElement.outerHTML
            .Quit
        End With
        dteFinish = Now
    
        Debug.Print "IE automation method"
        Debug.Print "Document length: " & Len(strResponse) & " chars"
        Debug.Print "Processed in: " & Format(dteFinish - dteStart, "s") & " seconds"
        Debug.Print VBA.vbNewLine
    
    ExitFunction:
        If Err.Number <> 0 Then
            Debug.Print Err.Description
        End If
        If Not objIe Is Nothing Then
            objIe.Quit
        End If
        Set objIe = Nothing
    
    End Sub
    
    Sub TestWinHHTP(strUrl As String)
    
        Dim dteStart As Date
        Dim dteFinish As Date
        Dim objHttp As WinHttp.WinHttpRequest
        Dim objDoc As HTMLDocument
        Dim strResponse As String
    
        On Error GoTo ExitFunction
    
        dteStart = Now
        Set objHttp = New WinHttp.WinHttpRequest
        With objHttp
            .Open "get", strUrl, False
            .setRequestHeader "Cache-Control", "no-cache"
            .setRequestHeader "Pragma", "no-cache"
            .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
            .send
            .WaitForResponse
            strResponse = .responseText
        End With
        dteFinish = Now
    
        Debug.Print "WinHTTP method"
        Debug.Print "Document length: " & Len(strResponse) & " chars"
        Debug.Print "Processed in: " & Format(dteFinish - dteStart, "s") & " seconds"
        Debug.Print VBA.vbNewLine
    
    ExitFunction:
        If Err.Number <> 0 Then
            Debug.Print Err.Description
        End If
        Set objDoc = Nothing
        Set objHttp = Nothing
    
    End Sub
    
    Sub TestXMLHTTP(strUrl As String)
    
        Dim dteStart As Date
        Dim dteFinish As Date
        Dim objXhr As MSXML2.XMLHTTP60
        Dim objDoc As MSHTML.HTMLDocument
        Dim strResponse As String
    
        On Error GoTo ExitFunction
    
        dteStart = Now
        Set objXhr = New MSXML2.XMLHTTP60
        With objXhr
            .Open "get", strUrl, False
            .setRequestHeader "Cache-Control", "no-cache"
            .setRequestHeader "Pragma", "no-cache"
            .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
            .send
            While .readyState <> 4
                DoEvents
            Wend
            strResponse = .responseText
        End With
        dteFinish = Now
    
        Debug.Print "XML HTTP method"
        Debug.Print "Document length: " & Len(strResponse) & " chars"
        Debug.Print "Processed in: " & Format(dteFinish - dteStart, "s") & " seconds"
        Debug.Print VBA.vbNewLine
    
    ExitFunction:
        If Err.Number <> 0 Then
            Debug.Print Err.Description
        End If
        Set objDoc = Nothing
        Set objXhr = Nothing
    
    End Sub
    
    Sub TestServerXMLHTTP(strUrl As String)
    
        Dim dteStart As Date
        Dim dteFinish As Date
        Dim objXhr As MSXML2.ServerXMLHTTP60
        Dim objDoc As MSHTML.HTMLDocument
        Dim strResponse As String
    
        On Error GoTo ExitFunction
    
        dteStart = Now
        Set objXhr = New MSXML2.ServerXMLHTTP60
        With objXhr
            .Open "get", strUrl, False
            .setRequestHeader "Cache-Control", "no-cache"
            .setRequestHeader "Pragma", "no-cache"
            .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
            .send
            While .readyState <> 4
                DoEvents
            Wend
            strResponse = .responseText
        End With
        dteFinish = Now
    
        Debug.Print "Server XML HTTP method"
        Debug.Print "Document length: " & Len(strResponse) & " chars"
        Debug.Print "Processed in: " & Format(dteFinish - dteStart, "s") & " seconds"
        Debug.Print VBA.vbNewLine
    
    ExitFunction:
        If Err.Number <> 0 Then
            Debug.Print Err.Description
        End If
        Set objDoc = Nothing
        Set objXhr = Nothing
    
    End Sub
    
    Sub TestUrlDownloadFile(strUrl As String)
    
        Dim dteStart As Date
        Dim dteFinish As Date
        Dim strTempFileName As String
        Dim strResponse As String
        Dim objFso As FileSystemObject
    
        On Error GoTo ExitFunction
    
        dteStart = Now
        strTempFileName = "D:\foo.txt"
        If DownloadFile(strUrl, strTempFileName) Then
            Set objFso = New FileSystemObject
            With objFso.OpenTextFile(strTempFileName, ForReading)
                strResponse = .ReadAll
                .Close
            End With
            objFso.DeleteFile strTempFileName
        Else
            Debug.Print "Error downloading file from URL: " & strUrl
            GoTo ExitFunction
        End If
        dteFinish = Now
    
        Debug.Print "URL download file method"
        Debug.Print "Document length: " & Len(strResponse) & " chars"
        Debug.Print "Processed in: " & Format(dteFinish - dteStart, "s") & " seconds"
        Debug.Print VBA.vbNewLine
    
    ExitFunction:
        If Err.Number <> 0 Then
            Debug.Print Err.Description
        End If
    
    End Sub
    
    'http://www.vbaexpress.com/forum/archive/index.php/t-27050.html
    Private Function DownloadFile(URL As String, LocalFilename As String) As Boolean
        Dim lngRetVal As Long
        lngRetVal = URLDownloadToFileA(0, URL, LocalFilename, 0, 0)
        If lngRetVal = 0 Then
            DownloadFile = True
        Else
            DownloadFile = False
        End If
    End Function
    
    Sub TestCreateDocumentFromUrl(strUrl As String)
    
        Dim dteStart As Date
        Dim dteFinish As Date
        Dim strResponse As String
        Dim objDoc1 As HTMLDocument
        Dim objDoc2 As HTMLDocument
    
        On Error GoTo ExitFunction
    
        dteStart = Now
        Set objDoc1 = New HTMLDocument
        Set objDoc2 = objDoc1.createDocumentFromUrl(strUrl, "null")
        While objDoc2.readyState <> "complete"
            DoEvents
        Wend
        strResponse = objDoc2.DocumentElement.outerHTML
        Debug.Print strResponse
        dteFinish = Now
    
        Debug.Print "HTML Document Create from URL method"
        Debug.Print "Document length: " & Len(strResponse) & " chars"
        Debug.Print "Processed in: " & Format(dteFinish - dteStart, "s") & " seconds"
        Debug.Print VBA.vbNewLine
    
    ExitFunction:
        If Err.Number <> 0 Then
            Debug.Print Err.Description
        End If
        Set objDoc2 = Nothing
        Set objDoc1 = Nothing
    
    End Sub