Search code examples
htmlvbaweb-scrapingencodingxmlhttprequest

XHR request response text has unexpected character set


I was looking at the answer, by @OmegaStripes, to this question How to get a particular InnerText from a specific class? Here one uses the Split function, and a specified delimiter string, to extract an href from .responseBody.

I then tried to replicate this to extract the following href :

"https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2018/02/New-AmbSYS-to-2018-Jan.csv" 

from NHS England's Ambulance Quality Indicators

HTML snippet:

<main class="main group" role="main">
        <div class="page-content" id="main-content">
            <header>
                <h1>Ambulance Quality Indicators</h1>
            </header>
            <article class="rich-text">
               <p></p>
              <p></p>
              <p></p>
               <p></p>
              <p></p>
              <p><strong>CSV Data</strong><br>
These files have the same data as other published spreadsheets, but without any formatting:<br>
                <a href="https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2018/02/New-AmbSYS-to-2018-Jan.csv" class="csv-link" onclick="ga('send', 'event', 'Downloads', 'CSV', 'https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2018/02/New-AmbSYS-to-2018-Jan.csv');">New Systems Indicators August 2017 to January 2018 (CSV, 23KB)</a><br>
            </article>
    </div>
</main>

Problem:

I am getting response text back that looks as follows:

Example response text:

Response Text

From a quick bit of research, see references, I am guessing this is perhaps an encoding problem?

I tried setting a .SetRequestHeader

 .setRequestHeader "Content-Type", _
     "application/x-www-form-urlencoded; charset=UTF-8"

This made no difference to the output.

To be honest, I haven't a clue how to resolve this.

Any suggestions please on how I get the expected response text? i.e. that I can parse for the href of interest.

Context:

This is part of a bigger piece of work where:

  1. I want to scrape that CSV link (the name of which will change each month), without having the browser pop-up

  2. Download the target file content

  3. Use ADODB.Stream to write the binary file out.

This process was outlined by @OmegaStripes in response to my question Return focus to ThisWorkbook.Activesheet after XMLHTTP60 file download . I am trying to understand and implement that suggestion currently.

Code:

Option Explicit

Public Const url As String = "https://www.england.nhs.uk/statistics/statistical-work-areas/ambulance-quality-indicators/"
Public aBody As String

Sub Testing()

    ' Download via XHR
    With CreateObject("MSXML2.XMLHTTP")

        .Open "GET", url, False
        .setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=utf-8"
        .send
        ' Get binary response content
        aBody = .responseBody
         
    End With
 
    ActiveSheet.Range("A1") = aBody

End Sub

References:

  1. XMLHTTP and Special Characters (eg, accents)

  2. setRequestHeader Method (IXMLHTTPRequest)

  3. VBA HTML Scraping - '.innertext' from complex table

  4. Msxml2.ServerXMLHTTP and UTF-8 charset issues


Solution

  • So credit goes to @FlorentB for this solution and a shout out to @OmegaStripes for the suggestion.

    As suggested, the problem indeed was the .responseBody was returning an array of bytes encoded as UTF-8. As pointed out, I was casting it to a String (UTF-16 encoding) hence all these foreign characters.

    I used @Tomalak's function BytesToString, with minor changes, to handle the conversion to string.

    Code:

    Option Explicit
    
    Public Const url As String = "https://www.england.nhs.uk/statistics/statistical-work-areas/ambulance-quality-indicators/"
    Public aBody As String 'this is causing the conversion
    Const adTypeBinary As Byte = 1
    Const adTypeText As Byte = 2
    Const adModeReadWrite As Byte = 3
    Public Const strPath As String = "C:\Users\User\Desktop\testXMLHTTPOutput"
    
    Public Sub Testing() 
        ' Download via XHR
        With CreateObject("MSXML2.XMLHTTP")
    
            .Open "GET", url, False
            .send
            ' Get binary response content
            aBody = BytesToString(.responseBody, "UTF-8")
    
        End With
    
        Dim fso As Object  'late binding
        Set fso = CreateObject("Scripting.FileSystemObject")
        Dim oFile As Object
        Set oFile = fso.CreateTextFile(strPath)
        oFile.WriteLine aBody
        oFile.Close
        Set fso = Nothing
        Set oFile = Nothing
    
    End Sub
    'ADODB.Stream with stream.CharSet = "UTF-8"
    'http://msdn.microsoft.com/en-us/library/windows/desktop/ms675032%28v=vs.85%29.aspx
    
    
    Public Function BytesToString(ByVal bytes As Variant, ByVal charset As String) As String
    
        With CreateObject("ADODB.Stream")
            .Mode = adModeReadWrite
            .Type = adTypeBinary
            .Open
            .Write bytes
            .Position = 0
            .Type = adTypeText
            .charset = charset
            BytesToString = .ReadText
        End With
    End Function
    

    Useful other link here:

    Save text file UTF-8 encoded with VBA