Search code examples
beautifulsoupurllib2urllib

dividing urllib2/beautifulsoup requests into smaller request packages


I wanted to assemble a set of patent IDs about the search term 'automobile'. I wrote this code:

import urllib2
from bs4 import BeautifulSoup
import sys
import StringIO
import re


search_term = 'automobile'
patent_list = []
for i in range(100): #for the first 100 pages of results
    web_page = 'https://www.lens.org/lens/search?q=' + str(search_term) + '&sat=P&l=en&st=true&p=' + str(i) + '&n=100'
    page = urllib2.urlopen(web_page)
    soup = BeautifulSoup(page,'html.parser')

    for aref in soup.findAll("a",href=True):
        if re.findall('/lens/patent',aref['href']):
            link = aref['href']
            split_link = link.split('/')
            if len(split_link) == 4:
                patent_list.append(split_link[-1])

print '\n'.join(set(patent_list))

However, I got a 503 error. I googled this and found it: '

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server.'

Does this mean

  1. Do not use an algorithm, manually assemble the IDs instead or
  2. Break the request down into smaller chunks.

If the answer is (2), how would I break this into smaller requests?


Solution

  • Does this mean (1) Do not use an algorithm, manually assemble the IDs instead or (2) Break the request down into smaller chunks.

    Neither.

    1. I don't understand what algorithm you are speaking about, but no.
    2. I'm not sure neither what you means by "smaller chunks", but again no.

    503 basically means the server is too busy or sometimes offline.

    When you run your script (or if you browse the website with your browser) you will notice how the server takes time to handle a single request, so you can guess if it struggles to handle a single request, 100 requests in a row is a little too much for your target.

    But still, the firsts 16, 17 or 18 calls works great. Maybe the server just needs a little more time between each query to handle that?

    Just add import time at top of your file, time.sleep(10) and the end of your loop and profit.

    You surely want to add some logs here and there, here is my version of your code (I just added time.sleep() + some prints)

    import urllib2
    from bs4 import BeautifulSoup
    import sys
    import StringIO
    import re
    import time
    
    
    search_term = 'automobile'
    patent_list = []
    for i in range(100): #for the first 100 pages of results
        web_page = 'https://www.lens.org/lens/search?q=' + str(search_term) + '&sat=P&l=en&st=true&p=' + str(i) + '&n=100'
        print('fetching {} ({})'.format(i, web_page))
        page = urllib2.urlopen(web_page)
        print('webpage fetched')
        soup = BeautifulSoup(page,'html.parser')
    
        for aref in soup.findAll("a",href=True):
            if re.findall('/lens/patent',aref['href']):
                link = aref['href']
                split_link = link.split('/')
                if len(split_link) == 4:
                    patent_list.append(split_link[-1])
    
        print('sleeping ten seconds')
        time.sleep(10)
    print '\n'.join(set(patent_list))
    

    Now the protip: There are no more than 400 items in database, so you can stop a page 4. You better check in your loop if you got result and if not break the loop.