Search code examples
pythonweb-scrapingspynner

download file over https query with python headless browser


I try to do web scraping in python on a website (using spynner and BeautifulSoup). At some point I want to test a zip file download, triggered by the following html query:

https://mywebsite.com/download?from=2011&to=2012

If explicitly used in a browser (chrome) this will trigger the download of a zip file with a given name. I have not been able to reproduce this behavior with my headless browser. I know it's not the right way to do it but using something like spynner:

from spynner import Browser
b = Browser()
b.load(webpage,wait_callback=wait_page_load, tries=3)
b.load_jquery(True)
...
output = b.load("https://website.com/download?from=2011&to=2012")
print b.html
>> ...

does not work of course (no zip file download). The last print statement shows I end up on an error page, with a java exception stack.

Is there a way to

  1. properly call the html query without using the spynner load mechanism?
  2. capture the resulting zip file?
  3. download it with a chosen name?

Thanks for your help.

One last thing that came after some testing on chrome with the java debugger, I have the following warning when doing it in the browser:

Resource interpreted as Document but transferred with MIME type application/zip "https://mywebsite.com/download?from=2011&to=2012"

Edited:

Found out that the call made was:

https://mywebsite.com/download?from=10%2F18%2F2011&to=10%2F18%2F2012

which can be used in a browser and should be replaced by

https://mywebsite.com/download?from=10/18/2011&to=10/18/2012

which could not be used in python because the URL encoding would map %2F into %252F


Solution

  • I'm not sure if this will handle your case, but give it a try:

    def download_finished(reply):
        try:
            with open('filename.ext', 'wb') as downloaded_file:
                downloaded_file.write(reply.readAll())
        except Exception:
            pass
    
        b.manager.finished.disconnect(download_finished)
    
    download_url = spynner.QUrl(url)
    request = spynner.QNetworkRequest(download_url)
    
    # requires: from PyQt4.QtCore import QByteArray
    request.setRawHeader('Accept', QByteArray(
        'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'))
    
    b.manager.finished.connect(download_finished)
    reply = b.manager.get(request)
    b.wait_requests(1)