Search code examples
vbaexceliframeweb-scrapinginternet-explorer-11

Unable to parse some links lying within an iframe


I've written a script in vba using IE to parse some links from a webpage. The thing is the links are within an iframe. I've twitched my code in such a way so that the script will first find a link within that iframe and navigate to that new page and parse the required content from there. If i do this way then I can get all the links.

Webpage URL: weblink

Successful approach (working one):

Sub Get_Links()
    Dim IE As New InternetExplorer, HTML As HTMLDocument
    Dim elem As Object, post As Object

    With IE
        .Visible = True
        .navigate "put here the above link"
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Set elem = .document.getElementById("compInfo")   #it is within iframe
        .navigate elem.src
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Set HTML = .document
    End With

    For Each post In HTML.getElementsByClassName("news")
        With post.getElementsByTagName("a")
         If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
        End With
    Next post
    IE.Quit
End Sub 

I've seen few sites where no such links exist within iframe so, I will have no option to use any link to track down the content.

If you take a look at the below approach by tracking the link then you can notice that I've parsed the content from a webpage which are within Iframe. There is no such link within Iframe to navigate to a new webpage to locate the content. So, I used contentWindow.document instead and found it working flawlessly.

Link to the working code of parsing Iframe content from another site: contentWindow approach

However, my question is: why should i navigate to a new webpage to collect the links as I can see the content in the landing page? I tried using contentWindow.document but it is giving me access denied error. How can I make my below code work using contentWindow.document like I did above?

I tried like this but it throws access denied error:

Sub Get_Links()
    Dim IE As New InternetExplorer, HTML As HTMLDocument
    Dim frm As Object, post As Object

    With IE
        .Visible = True
        .Navigate "put here the above link"
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Set HTML = .document
    End With

    ''the code breaks when it hits the following line "access denied error"

    Set frm = HTML.getElementById("compInfo").contentWindow.document

    For Each post In frm.getElementsByClassName("news")
        With post.getElementsByTagName("a")
         If .Length Then R = R + 1: Cells(R, 1) = .Item(0).href
        End With
    Next post
    IE.Quit
End Sub

I've attached an image to let you know which links (they are marked with pencil) I'm after.

These are the elements within which one such link (i would like to grab) is found:

<div class="news">
    <span class="news-date_time"><img src="images/arrow.png" alt="">19 Jan 2018 00:01</span>
    <a style="color:#5b5b5b;" href="/HomeFinancial.aspx?&amp;cocode=INE117A01022&amp;Cname=ABB-India-Ltd&amp;srno=17019039003&amp;opt=9">ABB India Limited - Press Release</a>
 </div>

Image of the links of that page I would like to grab:

enter image description here

From the very first day while creating this thread I strictly requested not to use this url http://hindubusiness.cmlinks.com/Companydetails.aspx?cocode=INE117A01022 to locate the data. I requested any solution from this main_page_link without touching the link within iframe. However, everyone is trying to provide solutions that I've already shown in my post. What did I put a bounty for then?


Solution

  • You can see the links within <iframe> in browser but can't access them programmatically due to Same-origin policy.

    There is the example showing how to retrieve the links using XHR and RegEx:

    Option Explicit
    
    Sub Test()
    
        Dim sContent As String
        Dim sUrl As String
        Dim aLinks() As String
        Dim i As Long
    
        ' Retrieve initial webpage HTML content via XHR
        With CreateObject("MSXML2.XMLHTTP")
            .Open "GET", "https://www.thehindubusinessline.com/stocks/abb-india-ltd/overview/", False
            .Send
            sContent = .ResponseText
        End With
        'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
        ' Extract target iframe URL via RegEx
        With CreateObject("VBScript.RegExp")
            .Global = True
            .MultiLine = True
            .IgnoreCase = True
            ' Process all a within div.news
            .Pattern = "<iframe[\s\S]*?src=""([^""]*?Companydetails[^""]*)""[^>]*>"
            sUrl = .Execute(sContent).Item(i).SubMatches(0)
        End With
        ' Retrieve iframe HTML content via XHR
        With CreateObject("MSXML2.XMLHTTP")
            .Open "GET", sUrl, False
            .Send
            sContent = .ResponseText
        End With
        'WriteTextFile sContent, CreateObject("WScript.Shell").SpecialFolders("Desktop") & "\tmp\tmp.htm", -1
        ' Parse links via XHR
        With CreateObject("VBScript.RegExp")
            .Global = True
            .MultiLine = True
            .IgnoreCase = True
            ' Process all anchors within div.news
            .Pattern = "<div class=""news"">[\s\S]*?href=""([^""]*)"
            With .Execute(sContent)
                ReDim aLinks(0 To .Count - 1)
                For i = 0 To .Count - 1
                    aLinks(i) = .Item(i).SubMatches(0)
                Next
            End With
        End With
        Debug.Print Join(aLinks, vbCrLf)
    
    End Sub
    

    Generally RegEx's aren't recommended for HTML parsing, so there is disclaimer. Data being processed in this case is quite simple that is why it is parsed with RegEx.

    The output for me as follows:

    /HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17047038016&opt=9
    /HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17046039003&opt=9
    /HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17045039006&opt=9
    /HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043039002&opt=9
    /HomeFinancial.aspx?&cocode=INE117A01022&Cname=ABB-India-Ltd&srno=17043010019&opt=9
    

    I also tried to copy the content of the <iframe> from IE to clipboard (for further pasting to the worksheet) using commands:

    IE.ExecWB OLECMDID_SELECTALL, OLECMDEXECOPT_DODEFAULT
    IE.ExecWB OLECMDID_COPY, OLECMDEXECOPT_DODEFAULT
    

    But actually that commands select and copy the main document, excluding the frame, unless I click on the frame manually. So that might be applied if click on the frame could be reproduced from VBA (frame node methods like .focus and .click didn't help).