Search code examples
ajaxpython-2.7python-requestsfirebug

Python2.7 Get the content of an iframe with requests


How do I get the content of this iframe with python's requests?

In Firebug, the content of the iframe is in a response from a POST request which I am finding difficult to access using Python.

Code.

import requests

iframe_url = "https://wwwrs.resbank.co.za/BpsReports/InstNotice.aspx"

# How do I scrape the payload for the POST request?
r = requests.post(iframe_url) # data=?

r.text

The response.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head><title>Bank Institutions</title>...

The response I want.

1|#||4|224|updatePanel|ctl00_MainContent_ReportViewer1_DocMap|<div id="ctl00_MainContent_ReportViewer1_ctl09" style="display:none;">
<input type="hidden" name="ctl00$MainContent$ReportViewer1$ctl09$ClientClickedId" id="ctl00_MainContent_ReportViewer1_ctl09_ClientClickedId" /></div>...

How do I scrape the data for the POST request?

EDIT.

I have added browser headers and a payload in the POST request. The response looks like the one I want (inasmuch as it is pipe-delimited) but it doesn't show the content of the iframe.

Additional code - Headers and payload.

headers = {
    'User-agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1', 
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}

payload = {
    "ctl00$MainContent$ScriptManager1": "ctl00$MainContent$ScriptManager1|ctl00$MainContent$ReportViewer1$ctl10$Reserved_AsyncLoadTarget",
    "__EVENTTARGET": "ctl00$MainContent$ReportViewer1$ctl10$Reserved_AsyncLoadTarget",
    "__EVENTARGUMENT": "",
    "__VIEWSTATE":  
    "/wEPDwUKLTYyOTE1Njg2OQ9kFgJmD2QWA..." # shortened 
    "__VIEWSTATEGENERATOR": "E6F8CED1",
    "__EVENTVALIDATION":    
    "/wEdABKvYwmAyl699cXAjRyhDjs8urvg...", # shortened
    "ctl00$MainContent$ReportViewer1$ctl03$ctl11":"ltr",
    "ctl00$MainContent$ReportViewer1$ctl03$ctl12":"standards",
    #"ctl00$MainContent$ReportViewer1$AsyncWait$HiddenCancelField":"False",
    "ctl00$MainContent$ReportViewer1$ctlToggleParam$store":"false",
    #"ctl00$MainContent$ReportViewer1$ctl08$collapse":"false",
    "ctl00$MainContent$ReportViewer1$ctl10$VisibilityState$ctl00":None,
    #"ctl00$MainContent$ReportViewer1$ctl10$ReportControl$ctl04":100,
    "__ASYNCPOST": "true"
}

r = requests.post(url, headers=headers, data=payload, timeout=3.5)

Several keys had no value and a few of the key-value pairs were causing a 500 server error so I have omitted them and commented them out respectively.

The response in Python does not include the iframe.

1|#||4|7645|updatePanel|ctl00_MainContent_ReportViewer1_ReportViewer|
  <div id="ctl00_MainContent_ReportViewer1" 
  onclick="if 
   ($get('ctl00_MainContent_ReportViewer1_ctl04') != null && 
    $get('ctl00_MainContent_ReportViewer1_ctl04').control != null) 
    $get('ctl00_MainContent_ReportViewer1_ctl04').control.HideActiveDropDown();" 
  onactivate="if
   ($get('ctl00_MainContent_ReportViewer1_ctl04') != null && 
    $get('ctl00_MainContent_ReportViewer1_ctl04').control != null) 
    $get('ctl00_MainContent_ReportViewer1_ctl04').control.HideActiveDropDown();" style="height:400px;width:400px;position:absolute;left:0px;Top:0px">
  <div id="ctl00_MainContent_ReportViewer1_HttpHandlerMissingErrorMessage" style="border-color:Red;border-width:2px;border-style:Solid;padding:10px;display:none;overflow:auto;font-size:.85em;">
   <h2>
      Report Viewer Configuration Error
   </h2><p>The Report Viewer Web Control HTTP Handler has not been registered in the application's web.config file.  
            Add <add verb="*" path="Reserved.ReportViewerWebControl.axd" 
            type = "Microsoft.Reporting.WebForms.HttpHandler, Microsoft.ReportViewer.WebForms, 
            Version=10.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a" /> to the system.web/httpHandlers section of the web.config file, or add <add 
            name="ReportViewerWebControlHandler" preCondition="integratedMode" verb="*" path="Reserved.ReportViewerWebControl.axd"
            type="Microsoft.Reporting.WebForms.HttpHandler, Microsoft.ReportViewer.WebForms, 
            Version=10.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a" /> to the system.webServer/handlers section for Internet Information Services 7 or later.
        </p>
  </div>
  <span id="ctl00_MainContent_ReportViewer1_ctl03">
    <input type="hidden" name="ctl00$MainContent$ReportViewer1$ctl03$ctl00" id="ctl00_MainContent_ReportViewer1_ctl03_ctl00" />
    <input type="hidden" name="ctl00$MainContent$ReportViewer1$ctl03$ctl01" id="ctl00_MainContent_ReportViewer1_ctl03_ctl01" />
  </span>
    <input type="hidden" name="ctl00$MainContent$ReportViewer1$ctl11" id="ctl00_MainContent_ReportViewer1_ctl11" />
    <input type="hidden" name="ctl00$MainContent$ReportViewer1$ctl12" id="ctl00_MainContent_ReportViewer1_ctl12" />
      <div id="ctl00_MainContent_ReportViewer1_AsyncWait" style="background-color:White;opacity:0.7;position:absolute;display:none;filter:alpha(opacity=70);">
      </div>
      <div id="ctl00_MainContent_ReportViewer1_AsyncWait_Wait" style="cursor:wait;background-color:#ECE9D8;padding:15px;border:1px solid black;display:none;position:absolute;">
        <table height="100%">
          <tr>
            <td width="32px" height="32px">
              <img src="/BpsReports/Reserved.ReportViewerWebControl.axd?OpType=Resource&Version=10.0.30319.1&Name=Microsoft.Reporting.WebForms.Icons.SpinningWheel.gif" style="height:32px;width:32px;" />
            </td>
            <td style="vertical-align:middle;text-align:center;">
              <span style="font-family:Verdana;font-size:14pt;">Loading...</span>
                <div style="margin-top:3px;">
                  <a href="javascript:$get('ctl00_MainContent_ReportViewer1_AsyncWait').control._cancelCurrentPostback();" style="font-family:Verdana;font-size:8pt;color:#3366CC;">Cancel</a>...

The response in Firebug includes the iframe.

1|#||4|224|updatePanel|ctl00_MainContent_ReportViewer1_DocMap|<div id="ctl00_MainContent_ReportViewer1_ctl09"
 style="display:none;">
    <input type="hidden" name="ctl00$MainContent$ReportViewer1$ctl09$ClientClickedId" id="ctl00_MainContent_ReportViewer1_ctl09_ClientClickedId"
 />
</div>|5472|updatePanel|ctl00_MainContent_ReportViewer1_ctl10_ReportArea|<div NewContentType="Microsoft
.Reporting.WebFormsClient.ReportAreaContent.ReportPage" ForNonReportContentArea="false" id="ctl00_MainContent_ReportViewer1_ctl10_VisibilityState"
 style="visibility:none;">
    <input type="hidden" name="ctl00$MainContent$ReportViewer1$ctl10$VisibilityState$ctl00" value="ReportPage"
 />
</div><input type="hidden" name="ctl00$MainContent$ReportViewer1$ctl10$ScrollPosition" id="ctl00_MainContent_ReportViewer1_ctl10_ScrollPosition"
 /><span id="ctl00_MainContent_ReportViewer1_ctl10_Reserved_AsyncLoadTarget"></span><div id="ctl00_MainContent_ReportViewer1_ctl10_ReportControl"
 style="display:none;">
    <span></span><input type="hidden" name="ctl00$MainContent$ReportViewer1$ctl10$ReportControl$ctl02" id
="ctl00_MainContent_ReportViewer1_ctl10_ReportControl_ctl02" /><input type="hidden" name="ctl00$MainContent$ReportViewer1$ctl10$ReportControl$ctl03"
 id="ctl00_MainContent_ReportViewer1_ctl10_ReportControl_ctl03" /><input type="hidden" name="ctl00$MainContent$ReportViewer1$ctl10$ReportControl$ctl04"
 id="ctl00_MainContent_ReportViewer1_ctl10_ReportControl_ctl04" value="100" /><div style="display:none
;">
        <DIV dir=...The content of the iframe.

Questions which may aid a solution.

  1. How do I scrape the payload for the POST request?

  2. Is the sequence of requests preparing the payload?

  3. How do I get the iframe with requests?

Thanks for reading.


Solution

  • The content of the iframe is available if the cookies are scraped and then added to the headers.

    headers = {'User-agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'}
    
    r = requests.get(url)
    cookie = {'Cookie': r.headers['Set-Cookie']}
    headers.update(cookie)
    

    The payload can be shortened by removing keys which have no values e.g. __EVENTARGUMENT.

    payload = {
        "__EVENTTARGET": "ctl00$MainContent$ReportViewer1$ctl10$Reserved_AsyncLoadTarget",
        "__VIEWSTATE": "/wEPDwUKLTYyOTE1Njg2OQ9kFgJmD2...", # shortened
        "__EVENTVALIDATION": "/wEdABKvYwmAyl699cXAjRyh...", # shortened
        "__ASYNCPOST": "true"
    }
    

    The content of the iframe (URLs) can be stored in a list.

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(r.text)
    links = soup.table.find_all('a')
    hrefs = [ i['href'] for i in links ]
    
    >>>['https://wwwrs.resbank.co.za/BpsReports/bankContactRprtPage.**Edit.**aspx?Code=BL', 'https://wwwrs.resbank.co.za/BpsReports/bankContactRprtPage.aspx?Code=BR', 'https://wwwrs.resbank.co.za/BpsReports/bankContactRprtPage.aspx?Code=R', 'https://wwwrs.resbank.co.za/BpsReports/bankContactRprtPage.aspx?Code=FB', 'https://wwwrs.resbank.co.za/BpsReports/bankContactRprtPage.aspx?Code=LB', 'https://wwwrs.resbank.co.za/BpsReports/bankContactRprtPage.aspx?Code=MB']
    

    Update

    This answer failed after a short period of time. I think this is because the parameters __VIEWSTATE and __ EVENTVALIDATION both change in every session and so they need to be scraped instead of being hard coded into the program.

    Updated solution

    Since some of the variables expire, the updated solution below scrapes new parameters on each run. These are: Cookie, __VIEWSTATE and __EVENTVALIDATION. This is now a working solution.

    import requests
    from bs4 import BeautifulSoup
    import re
    
    url = "https://wwwrs.resbank.co.za/BpsReports/InstNotice.aspx"
    
    r = requests.get(url)
    
    headers = {'User-agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) '\
        + 'Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'}
    
    cookie = {'Cookie': r.headers['Set-Cookie']}
    headers.update(cookie)
    
    payload = {
        '__EVENTTARGET': 'ctl00$MainContent$ReportViewer1$ctl10$Reserved_AsyncLoadTarget',
        '__ASYNCPOST': 'true'
    }
    
    view_state = re.search(r'__VIEWSTATE"\svalue="(.*)"', r.text)
    payload['__VIEWSTATE'] = str(view_state.group(1)) 
    # request docs advise str over unicode for header values.
    
    event_validation = re.search(r'__EVENTVALIDATION"\svalue="(.*)"', r.text)
    payload['__EVENTVALIDATION'] = str(event_validation.group(1))
    
    r = requests.post(url, headers=headers, data=payload, timeout=3.5)
    
    soup = BeautifulSoup(r.text)
    links = soup.table.find_all('a')
    hrefs = [ i['href'] for i in links ]