Search code examples
pythonpdfweb-scrapingbeautifulsoupfilenames

How to scrape pdf to local folder with filename = url and delay within iteration?


I scraped a website (url = "http://bla.com/bla/bla/bla/bla.txt") for all the links containing .pdf that were important to me. These are now stored in relative_paths:

['http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-0065.pdf',
 'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/3333/jjjjj-99-1679.pdf',
 'http://aa.bb.ccc.com/dd/ee-fff/gg/hh99/iii/4444/jjjjj-99-9526.pdf',]

Now i want to store the pdf "behind" the links in a local folder with their filename being their url.

None of the - although somewhat similar questions on the internet - seems to help me towards my goal. The closest i got was when it generated some weird file that didnt even have an extension. Here are some of the more promising code samples i already tried out.

for link in relative_paths:
    content = requests.get(link, verify = False)
    with open(link, 'wb') as pdf:
        pdf.write(content.content)

for link in relative_paths:  
    response = requests.get(url, verify = False)   
    with open(join(r'C:/Users/', basename(url)), 'wb') as f:
        f.write(response.content)

for link in relative_paths:
    filename = link
    with open(filename, 'wb') as f:
        f.write(requests.get(link, verify = False).content)

for link in relative_paths:
    pdf_response = requests.get(link, verify = False)
    filename = link
    with open(filename, 'wb') as f:
        f.write(pdf_response.content)

Now i am confused and dont know how to move forward. Can you transform one of the for loop and provide a small explanation, please? If the urls are too long for filename, a split at the 3rd last / is also ok. thanks :)

Also, i was asked by the website host to not scrape all of the pdfs at once so that the server does not get overloaded since there are thousands of pdfs behind the many links stored in relative_paths. That is why i am searching for a way to incorporate some sort of delay within my requests.


Solution

  • give this a shot:

    import time
    count_downloads = 25 #<--- wait x seconds after every 25 downloads
    time_delay = 60 #<--- wait 60 seconds after every y downloads
    
    for idx, link in enumerate(relative_paths):
        if idx % count_downloads == 0:
            print ('Waiting %s seconds...' %time_delay)
            time.sleep(time_delay)
        filename = link.split('jjjjj-')[-1] #<--whatever that is is where you want to split then
        
        try:
            with open(filename, 'wb') as f:
                f.write(requests.get(link).content)
                print ('Saved: %s' %link)
        except Exception as ex:
             print('%s not saved. %s' %(link,ex))