Search code examples
pythonweb-scrapingbeautifulsoup

how to scrape product details on amazon webpage using beautifulsoup


For webpage: http://www.amazon.com/Harry-Potter-Prisoner-Azkaban-Rowling/dp/0439136369/ref=pd_sim_b_2?ie=UTF8&refRID=1MFBRAECGPMVZC5MJCWG How could I scrape product details and output dict in python. In above case, the dict output I want to have will be:

Age Range: 9 - 12 years
Grade Level: 4 - 7
...
...

I'm new to beautifulsoup and didn't find good example to make this happen. I want to have some example to follow.


Solution

  • The idea is to iterate over all Product Details items with the help of table#productDetailsTable div.content ul li CSS selector, then use the bold text as a key and the next sibling as a value:

    from pprint import pprint
    from bs4 import BeautifulSoup
    import requests
    
    url = 'http://www.amazon.com/dp/0439136369'
    response = requests.get(url, headers={'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'})
    
    soup = BeautifulSoup(response.content)
    tags = {}
    for li in soup.select('table#productDetailsTable div.content ul li'):
        try:
            title = li.b
            key = title.text.strip().rstrip(':')
            value = title.next_sibling.strip()
    
            tags[key] = value
        except AttributeError:
            break
    
    pprint(tags)
    

    Prints:

    {
        u'Age Range': u'9 - 12 years',
        u'Amazon Best Sellers Rank': u'#1,440 in Books (',
        u'Average Customer Review': u'',
        u'Grade Level': u'4 - 7',
        u'ISBN-10': u'0439136369',
        u'ISBN-13': u'978-0439136365',
        u'Language': u'English',
        u'Lexile Measure': u'880L',
        u'Mass Market Paperback': u'448 pages',
        u'Product Dimensions': u'1.2 x 5.2 x 7.8 inches',
        u'Publisher': u'Scholastic Paperbacks (September 11, 2001)',
        u'Series': u'Harry Potter (Book 3)',
        u'Shipping Weight': u'11.2 ounces ('
    }
    

    Note that we are breaking the loop as soon as we hit an AttributeError. It happens on after there is no more bold text inside the li element.