I'm trying to get pdf files behind a domain that requires a username and password. I am able to get MechanicalSoup to enter my login credentials, however when I navigate to the pdf file I can view the pdf fine with MechanicalSoups launch_browser() but I cannot download the pdf. In the past (when using BeautifulSoup in python 2 for a site that didn't require authentication) I would just pass the url to urllib2 like so:
page = urllib2.urlopen(download_url)
file = open(fileName, 'w')
file.write(page.read())
file.close()
When I do similarly to urllib.requests I found that I now have to enter my login credentials again. So I tried (following directions here):
loginUrl = "http://..."
urlToPDF = "http://..."
passman = urllib.request.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, loginUrl, "username", "password")
authhandler = urllib.request.HTTPBasicAuthHandler(passman)
opener = urllib.request.build_opener(authhandler)
urllib.request.install_opener(opener)
page = urllib.request.urlopen(urlToPDF)
file = open("test.pdf", "wb")
file.write(page.read())
file.close()
However, this doesn't seem to work. When I change the filename to "test" (removed the ".pdf") and open in chrome it appears that it wrote to the file the webpage that says I need to click the button which would redirect me to the login page. Hence why I tried using both loginUrl and urlToPDF in the above code.
The forums I've read so far seem to say the above should work. Alternatively, considering I can view the pdf using MechanicalSoup is there a way to download a pdf directly with MechanicalSoup?
You can certainly download the PDF using MechanicalSoup.
The return value of many of the StatefulBrowser methods (including StatefulBrowser.open
and StatefulBrowser.follow_link
) is a requests.Response object. If the request is successful, then the data you want is stored in the Response.content
attribute. So, to download the file amounts to writing this attribute to a file!
Here is an example:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
response = browser.open("http://example.com/example.pdf")
with open('your_filename_here.pdf', 'wb') as f:
f.write(response.content)
In the future, I expect that MechanicalSoup will implement this more directly as a StatefulBrowser.download
method (or something along those lines). See this issue on the MechanicalSoup GitHub page to follow the development of this feature.