Search code examples
pythonpdfweb-scrapingrequestcdn

.requests() file type issues - Can't get PDF from Content Delivery Network


I'm having trouble getting the contents of PDFs, found here, due to their being hosted by a Content Delivery Network (CDN) called Widen.

The below code is successful on PDF links embedded within the webpage...

url = 'https://embed.widencdn.net/pdf/plus/widnr/kdlgedcepu/miss_surv_120117.pdf?u=7vkdxn'
    
filepath = r"C:\Users\pathgoeshere\{}.pdf".format('test')
if os.path.exists(filepath):
    pass
else:
    r = requests.get(url)
    with open(filepath, 'wb') as f:
        f.write(r.content)

... but since the url is to a content delivery network and not the pdf itself, the request does not return the desired pdf; when opening the pdf an error is thrown.

Can anyone lend a hand in scraping pdf files hosted via a content delivery network?


Solution

  • The issue that you do not get a pdf from the CDN because it encapsulates the pdf within a script that automatically sets a password and redirects your request to another URL. In order to download the pdf, you have to first extract the script tag from the header to find the url that points to the pdf. Then you have to build a second request with exactly the same parameters the script is setting:

    1. Signature
    2. Expires
    3. Key-Pair-Id

    The second request than downloads the pdf.

    import os
    import requests
    import urllib.parse as urlparse
    
    from urllib.parse import parse_qs
    from urlextract import URLExtract
    
    from bs4 import BeautifulSoup
    
    url = 'https://embed.widencdn.net/pdf/plus/widnr/rfazsshahb/Fall2017Waterfowl_GreenBay_Survey_Nov.pdf?u=7vkdxn'
    
    filepath = r'C:\Path\{}.pdf'.format('test')
    if os.path.exists(filepath):
        pass
    else:
        request = requests.get(url)
        html = BeautifulSoup(request .content)
        pdf_script = html.head.find('script', type="text/javascript").string
    
        # Extract the url
        extractor = URLExtract()
        url_to_pdf = extractor.find_urls(pdf_script)
    
        # Parse URL
        parsed = urlparse.urlparse(url_to_pdf[0])
    
        # Get parameters
        signature = parse_qs(parsed.query)['Signature'][0]
        expires = int(parse_qs(parsed.query)['Expires'][0])
        kip = parse_qs(parsed.query)['Key-Pair-Id'][0]
    
        url = parsed.scheme + "://" + parsed.netloc + parsed.path
    
        #Build second request
        pdf_request = requests.get(url, params={'Key-Pair-Id': kip, 'Signature': signature, 'Expires': expires})
        print(pdf_request)
        with open(filepath, 'wb') as f:
            f.write(pdf_request.content)
    
    

    You may need to install urlextract, BeautifulSoup

    pip install beautifulsoup4
    pip install urlextract
    

    Note that this is not a general solution and may only work with this CDN.