I try to do web scraping in python on a website (using spynner and BeautifulSoup). At some point I want to test a zip file download, triggered by the following html query:
https://mywebsite.com/download?from=2011&to=2012
If explicitly used in a browser (chrome) this will trigger the download of a zip file with a given name. I have not been able to reproduce this behavior with my headless browser. I know it's not the right way to do it but using something like spynner:
from spynner import Browser
b = Browser()
b.load(webpage,wait_callback=wait_page_load, tries=3)
b.load_jquery(True)
...
output = b.load("https://website.com/download?from=2011&to=2012")
print b.html
>> ...
does not work of course (no zip file download). The last print statement shows I end up on an error page, with a java exception stack.
Is there a way to
Thanks for your help.
One last thing that came after some testing on chrome with the java debugger, I have the following warning when doing it in the browser:
Resource interpreted as Document but transferred with MIME type application/zip "https://mywebsite.com/download?from=2011&to=2012"
Edited:
Found out that the call made was:
https://mywebsite.com/download?from=10%2F18%2F2011&to=10%2F18%2F2012
which can be used in a browser and should be replaced by
https://mywebsite.com/download?from=10/18/2011&to=10/18/2012
which could not be used in python because the URL encoding would map %2F
into %252F
I'm not sure if this will handle your case, but give it a try:
def download_finished(reply):
try:
with open('filename.ext', 'wb') as downloaded_file:
downloaded_file.write(reply.readAll())
except Exception:
pass
b.manager.finished.disconnect(download_finished)
download_url = spynner.QUrl(url)
request = spynner.QNetworkRequest(download_url)
# requires: from PyQt4.QtCore import QByteArray
request.setRawHeader('Accept', QByteArray(
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'))
b.manager.finished.connect(download_finished)
reply = b.manager.get(request)
b.wait_requests(1)