Search code examples
pythonpython-2.7findbeautifulsoupattributeerror

Python: find in BeautifulSoup works correctly on one windows machine and not correct on another


I have this simple code in Python 2.7 running on Windows 7 machines:

from urllib2 import urlopen
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser

def main():
    html_parser = HTMLParser()

    soup = BeautifulSoup(urlopen("http://www.amazon.com/gp/bestsellers/").read())

    categories = []

    for category_li in soup.find(attrs={'id':'zg_browseRoot'}).find('ul').findAll('li'):
        category = {}
        category['name'] = html_parser.unescape(category_li.a.string)
        category['url'] = category_li.a['href']

        categories.append(category) 

When I run it on one machine it works fine and when I run it on another machine I get this error message:

Traceback (most recent call last):
  File ".../tmp.py", line 10, in <module>
    for category_li in soup.find(attrs={'id':'zg_browseRoot'}).find('ul').findAll('li'):
AttributeError: 'NoneType' object has no attribute 'find'

Can anyone help me to find out why? Both machines have Python 2.7 installed. I really appreciate any help.


Solution

  • The different output between running the code on the two different machines was the parser being used to parse the html. On the machine that did not work lxml was installed so bs4 was using that, on the machine that did work you were using html.parser which we found out using the diagnose code.

    Running the diagnose code shows the available parsers on the system and how they parse the html:

    from bs4.diagnose import diagnose
    data = urlopen("http://www.amazon.com/gp/bestsellers/").read()
    diagnose(data)
    

    So changing the code on the system that had lxml installed to:

    soup = BeautifulSoup(urlopen("http://www.amazon.com/gp/bestsellers/").read(),"html.par‌​ser")
    

    changing the parser to html.par‌​serdid the trick.

    Interestingly I could run the code with either parser on ubuntu using the same version of bs4 4.3.2, the only difference being my lxml version was slightly older 3.4.1.0 vs 3.4.4.0.