Search code examples
pythonselenium-webdriverbeautifulsoup

Why does my code only scrap the first page of product reviews?


I'm scraping product reviews at this website" https://www.lazada.com.my/products/xiaomi-mi-a1-4gb-ram-32gb-rom-i253761547-s336359472.html?spm=a2o4k.searchlistcategory.list.64.71546883QBZiNT&search=1

I managed to get the reviews, but only the first page of them:

import pandas as pd
from urllib.request import Request, urlopen as uReq #package web scraping
from bs4 import BeautifulSoup as soup

def make_soup(website) :
req =  Request(website,headers = {'User-Agent' : 'Mozilla/5.0'})
uClient = uReq(req)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'html.parser')
return page_soup
lazada_url = 'https://www.lazada.com.my/products/xiaomi-mi-a1-4gb-ram-32gb-rom-i253761547-s336359472.html?spm=a2o4k.searchlistcategory.list.64.71546883QBZiNT&search=1'

website = make_soup(lazada_url)
news_headlines = pd.DataFrame( columns = ['reviews','sentiment','score'])
headlines = website.findAll('div',attrs={"class":"item-content"})
n = 0
for item in headlines :
    top = item.div
    #print(top)
    #print()
    text_headlines = top.text
    print(text_headlines)
    print()
    n +=1
    news_headlines.loc[n-1,'title'] = text_headlines

The result only shows the first page. How can I do this for all pages? There are no pages in the URL for me to loop through.

Here is the example output that I need to get all pages of results for:

I like this phone very much and it's global version. I recommend this phone for who like gaming. Delivery just took 3 days only. Thanks Lazada

Item was received in just two days and was wonderfully wrapped. Thanks for the excellent services Lazada!

Very happy with the phone. It's original, it arrived in good condition. Built quality is superb for a budget phone.

The delivery is very fast just take one day to reach at my home. However, the tax invoice is not attached. How do I get the tax invoice?

great deal from lazada. anyway, i do not find any tax invoice. please do email me the tax invoice. thank you.

Solution

  • You can scrape the pagination at the bottom of the reviews to find the minimum and maximum number of reviews:

    import requests
    from bs4 import BeautifulSoup as soup
    
    def get_page_reviews(content:soup) -> dict:
      rs = content.find('div', {'class':'mod-reviews'}).find_all('div', {'class':'item'})
      reviews = [i.find('div', {'class':'item-content'}).find('div', {'class':'content'}).text for i in rs]
      stars = [len(c.find('div', {'class':'top'}).find_all('img')) for c in rs]
      _by = [i.find('div', {'class':'middle'}).find('span').text for i in rs]
      return {'stars':stars, 'reviews':reviews, 'authors':_by}
    
    d = soup(requests.get('https://www.lazada.com.my/products/xiaomi-mi-a1-4gb-ram-32gb-rom-i253761547-s336359472.html?spm=a2o4k.searchlistcategory.list.64.71546883QBZiNT&search=1').text, 'html.parser')
    results = list(map(int, filter(None, [i.text for i in d.find_all('button', {'class':'next-pagination-item'})])))
    for i in range(min(results), max(results)+1):
      new_url = f'https://www.lazada.com.my/products/xiaomi-mi-a1-4gb-ram-32gb-rom-i253761547-s336359472.html?spm=a2o4k.searchlistcategory.list.64.71546883QBZiNT&search={i}'
      #now, can use new_url to request the next page of reviews
      r = get_page_reviews(soup(requests.get(new_url).text, 'html.parser'))
      final_result = [{'stars':a, 'author':b, 'review':c} for a, b, c in zip(r['stars'], r['authors'], r['reviews'])]
    

    Output (for first page):

    [{'stars': 5, 'author': 'by Ridwan R.', 'review': "I like this phone very much and it's global version. I recommend this phone for who like gaming. Delivery just took 3 days only. Thanks Lazada"}, {'stars': 5, 'author': 'by Razli A.', 'review': 'Item was received in just two days and was wonderfully wrapped. Thanks for the excellent services Lazada!'}, {'stars': 5, 'author': 'by Nur F.', 'review': "Very happy with the phone. It's original, it arrived in good condition. Built quality is superb for a budget phone."}, {'stars': 5, 'author': 'by Muhammad S.', 'review': 'The delivery is very fast just take one day to reach at my home. However, the tax invoice is not attached. How do I get the tax invoice?'}, {'stars': 5, 'author': 'by Xavier Y.', 'review': 'great deal from lazada. anyway, i do not find any tax invoice. please do email me the tax invoice. thank you.'}]