I wanted to pull down the data for a set of webpages.
This is an example of the URL:
http://www.signalpeptide.de/index.php?sess=&m=listspdb_mammalia&s=details&id=3&listname=
My question is:
I know the general two lines of code I need will be to make a list of numbers somehow, and then to loop through the numbers with this code to pull down the text of each page (parsing the text itself is another day's work):
import urllib2
from bs4 import BeautifulSoup
web_page = "http://www.signalpeptide.de/index.php?sess=&m=listspdb_mammalia&s=details&id=" + id_name + "&listname="
page = urllib2.urlopen(web_page)
soup = BeautifulSoup(page,'html.parser')
Can anyone advise on the best way to say 'take all of the pages' to get around the issues I'm facing of missing pages and not know when the last page is?
In order to get the possible pages, you can do something like(My example is Python3):
import re
from urllib.request import urlopen
from lxml import html
ITEMS_PER_PAGE = 50
base_url = 'http://www.signalpeptide.de/index.php'
url_params = '?sess=&m=listspdb_mammalia&start={}&orderby=id&sortdir=asc'
def get_pages(total):
pages = [i for i in range(ITEMS_PER_PAGE, total, ITEMS_PER_PAGE)]
last = pages[-1]
if last < total:
pages.append(last + (total - last))
return pages
def generate_links():
start_url = base_url + url_params.format(ITEMS_PER_PAGE)
page = urlopen(start_url).read()
dom = html.fromstring(page)
xpath = '//div[@class="content"]/table[1]//tr[1]/td[3]/text()'
pagination_text = dom.xpath(xpath)[0]
total = int(re.findall(r'of\s(\w+)', pagination_text)[0])
print(f'Number of records to scrape: {total}')
pages = get_pages(total)
links = (base_url + url_params.format(i) for i in pages)
return links
Basically what it does is fetch the first page and obtain the number of records, given that every page has 50 records, the get_pages() function can calculate the page numbers passed to the start parameter and generates all the pagination URLs, you need to fetch all those pages, iterate the table with each protein and go to the details page to obtain the information you require using BeautifulSoup or lxml with XPath. I tried getting all this pages concurrently using asyncio and the server was timing out :). Hope my functions help!