Search code examples
pythonbeautifulsoupurllib2

Python : Cannot access list element even though it exists


I'm trying to write a code to extract data from a website using Python and its urllib2 and BeautifulSoup libraries.

I tried iterating over the rows of the desired table and then storing the data in each row specified in "td" into a list variable row_data. Event though I can get the entire list to print, I cannot access the list at specific indexes and the interpreter throws up "list index out of range" error. Here goes my code and the output.

import urllib2
from bs4 import BeautifulSoup

link = 'http://www.babycenter.in/a25008319/most-popular-indian-baby-names-of-2013'
page = urllib2.urlopen(link)
soup = BeautifulSoup(page)
right_table = soup.find('table', class_= 'contentTable colborders')
name=[]
meaning=[]
alternate=[]

for row in right_table.find_all("tr"):
  row_datas = row.find_all("td")
  print row_datas
  print row_datas[0]

Output:

[]Traceback (most recent call last):
  File "C:\Users\forcehandler\Documents\python\data_scrape.py", line 41, in <module>

print row_datas[0]
IndexError: list index out of range
[Finished in 1.6s]

I tried similar code to mark out any obvious errors but to no avail. Code:

i = [range(y,10) for y in range(5)]
for j in i:
  print j
  print j[0]

Output:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
0
[1, 2, 3, 4, 5, 6, 7, 8, 9]
1
[2, 3, 4, 5, 6, 7, 8, 9]
2
[3, 4, 5, 6, 7, 8, 9]
3
[4, 5, 6, 7, 8, 9]
4

I'm new to programming and couldn't find help anywhere else. Thanks in advance!

Edit: The '[]' before Traceback might have accidentally slipped into the output while copy-pasting. And thanks for the you helpful answers/suggestions.

Solution: I didn't check the integrity of data before putting it to use. As it turns out, the first row consisted of only 'th' values and no 'td' values and hence the error.

Lesson: Always test the data before putting it to any use.

On a side note: This is my first question on StackOverflow and I am overwhelmed with such quick, quality and helpful responses.


Solution

  • Your output shows that at least one of the rows is empty:

    []Traceback (most recent call last):
    ^^
    

    That [] is an empty list, the output was produced by your print row_datas line. Normally I'd expect there to be a newline between that and the Traceback; perhaps you didn't copy your output correctly, or you have a console that uses a sized buffer rather than line buffering causing it to mixe stdout and stderr.

    That's because the first of those rows has th header cells in it instead:

    >>> rows = soup.select('table.contentTable tr')
    >>> rows[0].find('td') is None
    True
    >>> rows[0].find_all('th')
    [<th width="20%">Name</th>, <th>Meaning</th>, <th>Popular <br/>\nalternate spellings</th>]
    

    There is one other such row, so you'll have to code defensively:

    >>> rows[26]
    <tr><th width="20%">Name</th><th>Meaning</th><th>Popular <br/>\nalternate spellings</th></tr>
    

    You could just test if there are any elements with an if statement:

    if row_datas:
        print row_datas[0]
    

    Code to extract all the names, meanings and alternative spellings is as easy as:

    for row in soup.select('table.contentTable tr'):
        cells = row.find_all('td')
        if not cells:
            continue
        name_link = cells[0].find('a')
        name, link = name_link.get_text(strip=True), name_link.get('href')
        meaning, alt = (cell.get_text(strip=True) for cell in cells[1:])
        print '{}: {} ({})'.format(name, meaning, alt)