.requests() file type issues - Can't get PDF from Content Delivery Network

I'm having trouble getting the contents of PDFs, found here, due to their being hosted by a Content Delivery Network (CDN) called Widen.

The below code is successful on PDF links embedded within the webpage...

url = 'https://embed.widencdn.net/pdf/plus/widnr/kdlgedcepu/miss_surv_120117.pdf?u=7vkdxn'
    
filepath = r"C:\Users\pathgoeshere\{}.pdf".format('test')
if os.path.exists(filepath):
    pass
else:
    r = requests.get(url)
    with open(filepath, 'wb') as f:
        f.write(r.content)

... but since the url is to a content delivery network and not the pdf itself, the request does not return the desired pdf; when opening the pdf an error is thrown.

Can anyone lend a hand in scraping pdf files hosted via a content delivery network?

Solution

The issue that you do not get a pdf from the CDN because it encapsulates the pdf within a script that automatically sets a password and redirects your request to another URL. In order to download the pdf, you have to first extract the script tag from the header to find the url that points to the pdf. Then you have to build a second request with exactly the same parameters the script is setting:

Signature
Expires
Key-Pair-Id

The second request than downloads the pdf.

import os
import requests
import urllib.parse as urlparse

from urllib.parse import parse_qs
from urlextract import URLExtract

from bs4 import BeautifulSoup

url = 'https://embed.widencdn.net/pdf/plus/widnr/rfazsshahb/Fall2017Waterfowl_GreenBay_Survey_Nov.pdf?u=7vkdxn'

filepath = r'C:\Path\{}.pdf'.format('test')
if os.path.exists(filepath):
    pass
else:
    request = requests.get(url)
    html = BeautifulSoup(request .content)
    pdf_script = html.head.find('script', type="text/javascript").string

    # Extract the url
    extractor = URLExtract()
    url_to_pdf = extractor.find_urls(pdf_script)

    # Parse URL
    parsed = urlparse.urlparse(url_to_pdf[0])

    # Get parameters
    signature = parse_qs(parsed.query)['Signature'][0]
    expires = int(parse_qs(parsed.query)['Expires'][0])
    kip = parse_qs(parsed.query)['Key-Pair-Id'][0]

    url = parsed.scheme + "://" + parsed.netloc + parsed.path

    #Build second request
    pdf_request = requests.get(url, params={'Key-Pair-Id': kip, 'Signature': signature, 'Expires': expires})
    print(pdf_request)
    with open(filepath, 'wb') as f:
        f.write(pdf_request.content)

You may need to install urlextract, BeautifulSoup

pip install beautifulsoup4
pip install urlextract

Note that this is not a general solution and may only work with this CDN.