I wanted to assemble a set of patent IDs about the search term 'automobile'. I wrote this code:
import urllib2
from bs4 import BeautifulSoup
import sys
import StringIO
import re
search_term = 'automobile'
patent_list = []
for i in range(100): #for the first 100 pages of results
web_page = 'https://www.lens.org/lens/search?q=' + str(search_term) + '&sat=P&l=en&st=true&p=' + str(i) + '&n=100'
page = urllib2.urlopen(web_page)
soup = BeautifulSoup(page,'html.parser')
for aref in soup.findAll("a",href=True):
if re.findall('/lens/patent',aref['href']):
link = aref['href']
split_link = link.split('/')
if len(split_link) == 4:
patent_list.append(split_link[-1])
print '\n'.join(set(patent_list))
However, I got a 503 error. I googled this and found it: '
The server is currently unable to handle the request due to a temporary overloading or maintenance of the server.'
Does this mean
If the answer is (2), how would I break this into smaller requests?
Does this mean (1) Do not use an algorithm, manually assemble the IDs instead or (2) Break the request down into smaller chunks.
Neither.
503 basically means the server is too busy or sometimes offline.
When you run your script (or if you browse the website with your browser) you will notice how the server takes time to handle a single request, so you can guess if it struggles to handle a single request, 100 requests in a row is a little too much for your target.
But still, the firsts 16, 17 or 18 calls works great. Maybe the server just needs a little more time between each query to handle that?
Just add import time
at top of your file, time.sleep(10)
and the end of your loop and profit.
You surely want to add some logs here and there, here is my version of your code (I just added time.sleep()
+ some prints)
import urllib2
from bs4 import BeautifulSoup
import sys
import StringIO
import re
import time
search_term = 'automobile'
patent_list = []
for i in range(100): #for the first 100 pages of results
web_page = 'https://www.lens.org/lens/search?q=' + str(search_term) + '&sat=P&l=en&st=true&p=' + str(i) + '&n=100'
print('fetching {} ({})'.format(i, web_page))
page = urllib2.urlopen(web_page)
print('webpage fetched')
soup = BeautifulSoup(page,'html.parser')
for aref in soup.findAll("a",href=True):
if re.findall('/lens/patent',aref['href']):
link = aref['href']
split_link = link.split('/')
if len(split_link) == 4:
patent_list.append(split_link[-1])
print('sleeping ten seconds')
time.sleep(10)
print '\n'.join(set(patent_list))
Now the protip: There are no more than 400 items in database, so you can stop a page 4. You better check in your loop if you got result and if not break the loop.