Search code examples
pythonscreen-scraping

How to recursively scrape a web page to check if there are new pdf files in python?


Theres a website which puts up a pdf report every month. I want to monitor it every hour and get the new pdf emailed to my email I'd whenever the new pdf gets uploaded. I want to use python for it. Also I'm familiar with beautiful soup and scrapy but I dont know how to check for new pdf files and only grab the new pdf file.


Solution

  • import requests
    from time import strftime, sleep
    
    # Here I Will Get The Current Month And Year And Assign It To Variables.
    
    month = strftime("%B").lower()
    year = strftime("%Y")
    
    # I've noticed that the website publishing the file by the Month name in lower-case and year.
    # Now we will loop on the url each one hour and download the file once released and exit.
    # You Have to set cron-job to run every month.
    
    
    while True:
        r = requests.get(
            f"http://www.pbs.gov.pk/sites/default/files//price_statistics/monthly_price_indices/nb/2019/cpi_review_nb_{month}_{year}.pdf")
        if r.status_code == 200:
            with open(f"{month}.pdf", 'wb') as f:
                f.write(r.content)
                print(f"File Already Saved As {month}.pdf")
                break
        else:
            print("File Not Raised Yet, We Will Check Back After One Hour.")
            sleep(3600)
            continue
    
    # you can also put the current href links in a file.
    # then loop over source and check if new href is released so download and append the href to file.
    # but if the href in file so keep checking each hour.