I'm having trouble getting the contents of PDFs, found here, due to their being hosted by a Content Delivery Network (CDN) called Widen.
The below code is successful on PDF links embedded within the webpage...
url = 'https://embed.widencdn.net/pdf/plus/widnr/kdlgedcepu/miss_surv_120117.pdf?u=7vkdxn'
filepath = r"C:\Users\pathgoeshere\{}.pdf".format('test')
if os.path.exists(filepath):
pass
else:
r = requests.get(url)
with open(filepath, 'wb') as f:
f.write(r.content)
... but since the url is to a content delivery network and not the pdf itself, the request does not return the desired pdf; when opening the pdf an error is thrown.
Can anyone lend a hand in scraping pdf files hosted via a content delivery network?
The issue that you do not get a pdf from the CDN because it encapsulates the pdf within a script that automatically sets a password and redirects your request to another URL. In order to download the pdf, you have to first extract the script tag from the header to find the url that points to the pdf. Then you have to build a second request with exactly the same parameters the script is setting:
The second request than downloads the pdf.
import os
import requests
import urllib.parse as urlparse
from urllib.parse import parse_qs
from urlextract import URLExtract
from bs4 import BeautifulSoup
url = 'https://embed.widencdn.net/pdf/plus/widnr/rfazsshahb/Fall2017Waterfowl_GreenBay_Survey_Nov.pdf?u=7vkdxn'
filepath = r'C:\Path\{}.pdf'.format('test')
if os.path.exists(filepath):
pass
else:
request = requests.get(url)
html = BeautifulSoup(request .content)
pdf_script = html.head.find('script', type="text/javascript").string
# Extract the url
extractor = URLExtract()
url_to_pdf = extractor.find_urls(pdf_script)
# Parse URL
parsed = urlparse.urlparse(url_to_pdf[0])
# Get parameters
signature = parse_qs(parsed.query)['Signature'][0]
expires = int(parse_qs(parsed.query)['Expires'][0])
kip = parse_qs(parsed.query)['Key-Pair-Id'][0]
url = parsed.scheme + "://" + parsed.netloc + parsed.path
#Build second request
pdf_request = requests.get(url, params={'Key-Pair-Id': kip, 'Signature': signature, 'Expires': expires})
print(pdf_request)
with open(filepath, 'wb') as f:
f.write(pdf_request.content)
You may need to install urlextract, BeautifulSoup
pip install beautifulsoup4
pip install urlextract
Note that this is not a general solution and may only work with this CDN.