Search code examples
pythonpython-3.xpdf-generationurllibmechanicalsoup

Issue Downloading PDF with urllib on Website Requiring Authentication


I'm trying to get pdf files behind a domain that requires a username and password. I am able to get MechanicalSoup to enter my login credentials, however when I navigate to the pdf file I can view the pdf fine with MechanicalSoups launch_browser() but I cannot download the pdf. In the past (when using BeautifulSoup in python 2 for a site that didn't require authentication) I would just pass the url to urllib2 like so:

page = urllib2.urlopen(download_url)
file = open(fileName, 'w')
file.write(page.read())
file.close()

When I do similarly to urllib.requests I found that I now have to enter my login credentials again. So I tried (following directions here):

loginUrl = "http://..."
urlToPDF = "http://..."
passman = urllib.request.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, loginUrl, "username", "password")

authhandler = urllib.request.HTTPBasicAuthHandler(passman)
opener = urllib.request.build_opener(authhandler)

urllib.request.install_opener(opener)

page = urllib.request.urlopen(urlToPDF)
file = open("test.pdf", "wb")
file.write(page.read())
file.close()

However, this doesn't seem to work. When I change the filename to "test" (removed the ".pdf") and open in chrome it appears that it wrote to the file the webpage that says I need to click the button which would redirect me to the login page. Hence why I tried using both loginUrl and urlToPDF in the above code.

The forums I've read so far seem to say the above should work. Alternatively, considering I can view the pdf using MechanicalSoup is there a way to download a pdf directly with MechanicalSoup?


Solution

  • You can certainly download the PDF using MechanicalSoup.

    The return value of many of the StatefulBrowser methods (including StatefulBrowser.open and StatefulBrowser.follow_link) is a requests.Response object. If the request is successful, then the data you want is stored in the Response.content attribute. So, to download the file amounts to writing this attribute to a file!

    Here is an example:

    import mechanicalsoup
    
    browser = mechanicalsoup.StatefulBrowser()
    response = browser.open("http://example.com/example.pdf")
    
    with open('your_filename_here.pdf', 'wb') as f:
        f.write(response.content)
    

    In the future, I expect that MechanicalSoup will implement this more directly as a StatefulBrowser.download method (or something along those lines). See this issue on the MechanicalSoup GitHub page to follow the development of this feature.