Search code examples
excelvbaxmlhttprequest

Export HTML to text file with different results


I have two codes .. that are supposed to export the html file to text file

Sub Demo1()
Dim http        As New XMLHTTP60
Dim html        As New HTMLDocument

With http
    .Open "GET", "https://www.google.com.eg/", False
    .send
    html.body.innerHTML = .responseText

    WriteTxtFile html.body.innerHTML
End With
End Sub

Sub WriteTxtFile(ByVal aString As String, Optional ByVal filePath As String = "C:\Users\Future\Desktop\Output.txt")
Dim fso         As Object
Dim fileout     As Object

Set fso = CreateObject("Scripting.FileSystemObject")
Set fileout = fso.CreateTextFile(filePath, True, True)
fileout.write aString
fileout.Close
End Sub

Sub Demo2()
Dim ie          As Object
Dim f           As Integer

Set ie = CreateObject("InternetExplorer.Application")

With ie
    .Visible = True
    .navigate ("https://www.google.com.eg/")

    Do: DoEvents: Loop Until .readyState = 4

    f = FreeFile()
    Open ThisWorkbook.Path & "\Sample.txt" For Output As #f
    Print #f, .document.body.innerHTML
    Close #f

    .Quit
End With
End Sub

Both Demo1 and Demo2 are the codes .. and they resulted in "Sample.txt" and "Output.txt" But I found those html documents are different results Can you help me to clarify what is the right one .. and why they are different?

Thanks advanced for help


Solution

  • Xmlhttp does not provide all the rendered content of a webpage. Particularly anything rendered via JavaScript execution. Any scripts are not executed.

    Internet Explorer on the other hand will render the page (provided the browser version and JavaScript syntax is supported. For example, you will run into problems with the ec6 - latest Ecmascript as this is not supported on legacy browsers. It is I believe on Edge for Windows 10. You can check compatibility tables to see what is and isn’t supported ) fully.

    If you familiarize yourself with dev tools for your browser you can explore how different parts of a webpage are rendered. You can learn to debug scripts and see what changes are made to the DOM and page styling. Often a page will issue XHR requests to update content on a page for example. If you want to have a play look here.

    So, I suspect that the first html document may have less content and a different overall DOM structure from the second on this basis.

    To test for differences due to writing to text file methodology you need to compare Apples with Apples i.e use the same scraping access method and syntax to retrieve the page content before writing out.

    Please provide the differences if you want a deeper explanation.


    Exploring page updating:

    1. Firefox Network Tab
    2. Internet Explorer Network Inspector
    3. Chrome Network Tab