Search code examples
pythonscreen-scraping

Random "IndexError: list index out of range "


I am trying to scrape a site that returns its data via Javascript. The code I wrote using BeautifulSoup works pretty well, but at random points during scraping I get the following error:

Traceback (most recent call last):
File "scraper.py", line 48, in <module>
accessible = accessible[0].contents[0]
IndexError: list index out of range

Sometimes I can scrape 4 urls, sometimes 15, but at some point the script eventually fails and gives me the above error. I can find no pattern behind the failing, so I'm really at a loss here - what am I doing wrong?

from bs4 import BeautifulSoup
import urllib
import urllib2
import jabba_webkit as jw
import csv
import string
import re
import time

countries = csv.reader(open("countries.csv", 'rb'), delimiter=",")
database = csv.writer(open("herdict_database.csv", 'w'), delimiter=',')

basepage = "https://www.herdict.org/explore/"
session_id = "indepth;jsessionid=C1D2073B637EBAE4DE36185564156382"
ccode = "#fc=IN"
end_date = "&fed=12/31/"
start_date = "&fsd=01/01/"

year_range = range(2009, 2011)
years = [str(year) for year in year_range]

def get_number(var):
    number = re.findall("(\d+)", var)

    if len(number) > 1:
        thing = number[0] + number[1]
    else:
        thing = number[0]

    return thing

def create_link(basepage, session_id, ccode, end_date, start_date, year):
    link = basepage + session_id + ccode + end_date + year + start_date + year
    return link



for ccode, name in countries:
    for year in years:
        link = create_link(basepage, session_id, ccode, end_date, start_date, year)
        print link
        html = jw.get_page(link)
        soup = BeautifulSoup(html, "lxml")

        accessible = soup.find_all("em", class_="accessible")
        inaccessible = soup.find_all("em", class_="inaccessible")

        accessible = accessible[0].contents[0]
        inaccessible = inaccessible[0].contents[0]

        acc_num = get_number(accessible)
        inacc_num = get_number(inaccessible)

        print acc_num
        print inacc_num
        database.writerow([name]+[year]+[acc_num]+[inacc_num])

        time.sleep(2)

Solution

  • You need to add error-handling to your code. When scraping a lot of websites, some will be malformed, or somehow broken. When that happens, you'll be trying to manipulate empty objects.

    Look through the code, find all assumptions where you're assuming it works, and check against errors.

    For that specific case, I would do this:

    if not inaccessible or not accessible:
        # malformed page
        continue