Search code examples
pythonpdfgetpython-requests

How to prevent downloading an empty pdf file while using get and requests in Python?


I am scraping a website which is accessible from this link, using Beautiful Soup. The idea is to download all href that contain the string .pdf using the get module.

The code below demonstrated the procedure and is working as intended:

filename = 'new_name.pdf'
url_to_download_pdf='https://bradscholars.brad.ac.uk/https://www.brad.ac.uk/library/additional-help/bradford-scholars-faqs/digital_preservation_policy.pdf'
with open(filename, 'wb') as f:
    f.write(requests.get(url_to_download_pdf).content)

However, there is instance where the url such as given above (i.e., the variable url_to_download_pdf) direct to Page not found page. As a result, an unusable and unreadable pdf is downloaded.

Opening the file with pdf reader in Windows give the following warning

enter image description here

I am curious if there is any ways to avoid accessing and downloading an invalid pdf file?


Solution

  • Thanks for the suggestion by the user.

    As per @Nicolas,

    Do the save as pdf only if the response return 200

    if response.status_code == 200:
    

    In the previous version, an empty file will be created regardless of the response because following with open(filename, 'wb') as f: was created before the checking status_code

    To mitigate this, the with open(filename, 'wb') as f: should be initiated only if the condition set was as intended.

    The complete code then is as below:

    import requests
    filename = 'new_name.pdf'
    url_to_download_pdf='https://bradscholars.brad.ac.uk/https://www.brad.ac.uk/library/additional-help/bradford-scholars-faqs/digital_preservation_policy.pdf'
    my_req = requests.get(url_to_download_pdf)
    if my_req.status_code == 200:
        with open(filename, 'wb') as f:
            f.write(my_req.content)