Search code examples
pythonpython-2.7web-scrapingbeautifulsouppyqt4

beautifulsoup return empty value


I'm using Jupyter Python 2.7 I was trying to retrieve data from this website and everything went on well with using beautifulsoup and lxml parser to scrape description or price.

site = 'https://www.bedbathandbeyond.com/store/product/dyson-v7-motorhead-cord-free-stick-vacuum-in-fuchsia-steel/1061083288?brandId=162'

however, when I was trying to scrape comments or location of the reviewer, I could not retrieve anything back, only an empty list []

I also tried PyQt4 to render it first yet it still didn't work. How should I fix it now?

My code is attached below

import PyQt4
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import * 
import sys
from lxml import html
from bs4 import BeautifulSoup
import os
import requests

site = 'https://www.bedbathandbeyond.com/store/product/dyson-v7-motorhead-cord-free-stick-vacuum-in-fuchsia-steel/1061083288?brandId=162'

class Render(QWebPage):     
    def __init__(self, url):
        self.app = QApplication(sys.argv)  
        QWebPage.__init__(self)  
        self.loadFinished.connect(self._loadFinished)  
        self.mainFrame().load(QUrl(url))  
        self.app.exec_()   
    def _loadFinished(self, result):  
        self.frame = self.mainFrame()  
        self.app.quit()
r = Render(site)  
result = r.frame.toHtml()
formatted_result = str(result.toAscii())
tree = html.fromstring(formatted_result)
soup = BeautifulSoup(formatted_result,'lxml')
soup.find_all('span', class_ = 'BVRRValue BVRRUserLocation')#return value is []

Many thanks!


Solution

  • I quickly checked the referenced URL, and the reviews are only loaded via an asynchronous call after you click on the "Ratings & Review" tab. So if you just load the page without any extra navigation, the reviews will not be present in the DOM (and thus not in the HTML that you are parsing with BeautifulSoup).

    So a solution is to simply trigger the click on the "Ratings & Review" before you fetch the HTML and pass it to BeautifulSoup.

    Alternatively you can make that same asynchronous call to fetch the reviews yourself. The first page of the reviews is retrieved by performing a GET request to this page: https://bedbathandbeyond.ugc.bazaarvoice.com/2009-en_us/1061083288/reviews.djs?format=embeddedhtml&page=1&scrollToTop=true.

    You could easily construct this URL yourself for every product on bedbathandbeyond as you only need to have the product id (1061083288 in this case) which can be easily fetched from the original DOM using for instance the div with id prodRatings. It contains a hidden input field with the product id. This you can then simply replace in the URL from before and this would allow you to fetch all the reviews from all the products.