Search code examples
pythonbeautifulsoupurllib3urlopen

How can I access a PDF file with Python through an automatic download link?


I am trying to create an automated Python script that goes to a webpage like this, finds the link at the bottom of the body text (anchor text "here"), and downloads the PDF that loads after clicking said download link. I am able to retrieve the HTML from the original and find the download link, but I don't know how to get the link to the PDF from there. Any help would be much appreciated. Here's what I have so far:

import urllib3
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Open page and locate href for bill text
url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
links = [] 
for link in soup.findAll('a', href=True, text=['HERE', 'here', 'Here']):
    links.append(link.get('href'))  
links2 = [x for x in links if x is not None]

# Open download link to get PDF
html = urlopen(links2[0])
soup = BeautifulSoup(html, 'html.parser')
links = [] 
for link in soup.findAll('a'):
    links.append(link.get('href'))  
links2 = [x for x in links if x is not None]

At this point the list of links I get does not include the PDF that I am looking for. Is there any way to grab this without hardcoding the link to the PDF in the code (that would be counterintuitive to what I am trying to do here)? Thanks!


Solution

  • Looks for the a element with the text here then follows the trail.

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.murphy.senate.gov/newsroom/press-releases/murphy-blumenthal-introduce-legislation-to-create-a-national-green-bank-thousands-of-clean-energy-jobs'
    
    user_agent = {'User-agent': 'Mozilla/5.0'}
    
    s = requests.Session()
    
    r = s.get(url, headers=user_agent)
    soup = BeautifulSoup(r.content, 'html.parser')
    for a in soup.select('a'):
        if a.text == 'here':
            href = a['href']
            r = s.get(href, headers=user_agent)
            print(r.status_code, r.reason)
            print(r.headers)
            _, dl_url = r.headers['refresh'].split('url=', 1)
            r = s.get(dl_url, headers=user_agent)
            print(r.status_code, r.reason)
            print(r.headers)
            file_bytes = r.content # here's your PDF; you can write it out to a file