Search code examples
pythonbeautifulsoupurllib2

Scraping data from URLs: how to retrieve all the URL pages with missing and unknown final page IDs


I wanted to pull down the data for a set of webpages.

This is an example of the URL:

http://www.signalpeptide.de/index.php?sess=&m=listspdb_mammalia&s=details&id=3&listname=

My question is:

  1. The 'id=' number in the URL changes between different pages.
  2. I want to loop through and retrieve all of the pages in the database.
  3. There will be id's missing (e.g. maybe there will be a page with id=3 and id=6, but not id=4 and id=5).
  4. I do not know what the final number of the IDs is (e.g. maybe the last page in the database is id=100000 or id=1000000000, I do not know).

I know the general two lines of code I need will be to make a list of numbers somehow, and then to loop through the numbers with this code to pull down the text of each page (parsing the text itself is another day's work):

import urllib2
from bs4 import BeautifulSoup
web_page = "http://www.signalpeptide.de/index.php?sess=&m=listspdb_mammalia&s=details&id=" + id_name + "&listname="
page = urllib2.urlopen(web_page)
 soup = BeautifulSoup(page,'html.parser')

Can anyone advise on the best way to say 'take all of the pages' to get around the issues I'm facing of missing pages and not know when the last page is?


Solution

  • In order to get the possible pages, you can do something like(My example is Python3):

    import re
    from urllib.request import urlopen
    from lxml import html
    
    ITEMS_PER_PAGE = 50
    
    base_url = 'http://www.signalpeptide.de/index.php'
    url_params = '?sess=&m=listspdb_mammalia&start={}&orderby=id&sortdir=asc'
    
    
    def get_pages(total):
        pages = [i for i in range(ITEMS_PER_PAGE, total, ITEMS_PER_PAGE)]
        last = pages[-1]
        if last < total:
            pages.append(last + (total - last))
        return pages
    
    def generate_links():
        start_url = base_url + url_params.format(ITEMS_PER_PAGE)
        page = urlopen(start_url).read()
        dom = html.fromstring(page)
        xpath = '//div[@class="content"]/table[1]//tr[1]/td[3]/text()'
        pagination_text = dom.xpath(xpath)[0]
        total = int(re.findall(r'of\s(\w+)', pagination_text)[0])
        print(f'Number of records to scrape: {total}')
        pages = get_pages(total)
        links = (base_url + url_params.format(i) for i in pages)
        return links
    

    Basically what it does is fetch the first page and obtain the number of records, given that every page has 50 records, the get_pages() function can calculate the page numbers passed to the start parameter and generates all the pagination URLs, you need to fetch all those pages, iterate the table with each protein and go to the details page to obtain the information you require using BeautifulSoup or lxml with XPath. I tried getting all this pages concurrently using asyncio and the server was timing out :). Hope my functions help!