Search code examples
pythonweb-scrapingxpathlxml

Xpath HTML Scraping doesn't return text / numerical - useful score


I am scraping the usefulness scores of reviews using xpath and lxml.

#%% Step 1: Import all of the extensions and packages.
from lxml import html
from urllib import request
import requests
from datetime import datetime
import csv
import re
from glob import glob
import pandas as pd

reviewcontent = []
usefulness

#%%
import glob
path = pathx
for files in glob.glob(path + "*.htm*"):
    with open(files, "r", encoding="utf-8", errors="ignore") as f:
        page = f.read()
        tree = html.fromstring(page)
        reviews = tree.xpath('//*[@class="styles_reviewContent__0Q2Tg"]')
        reviews = [r.text_content() for r in reviews]
        reviews = [r.replace('\n', ' ') for r in reviews]
        reviews = [r.replace('\r', ' ') for r in reviews]
        reviews = [r.lstrip() for r in reviews]
        reviewcontent += reviews    
        useful = tree.xpath('//*[@class="typography_body-m__xgxZ_ typography_appearance-inherit__D7XqR styles_usefulLabel__qz3JV"]')
        useful = [u.text_content() for u in useful]
        useful = [u.lstrip() for u in useful]
        helpfulness += useful

While i can perfectly extract the review content, somehow the code doesn't work with extracting the usefulness score? It did work and provided as output:

'Useful'
'Useful1' 
'Useful'
'Useful2' 

i.e. the second review received 1 vote, the 4th received 2. However, somehow either i changed something or i don't know what, but it don't get any output anymore

Example link: https://www.trustpilot.com/review/trivago.com

My goal is thus to scrape for every review the number of votes they received, including 0.

Tried different configurations and stackoverflow topics, also looking at the span code but to no help.

Thank you!


Solution

  • To get score, title, text of the review on different pages you can use next example:

    import json
    
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.trustpilot.com/review/trivago.com?page="
    
    for page in range(1, 4):  # <-- adjust number of pages here
        soup = BeautifulSoup(requests.get(url + str(page)).content, "html.parser")
        data = soup.select_one("#__NEXT_DATA__")
        data = json.loads(data.text)
    
        for review in data["props"]["pageProps"]["reviews"]:
            print(f"{review['rating']}/5", review["title"])
            print(review["text"])
            print("-" * 80)
    

    Prints:

    
    ...
    
    --------------------------------------------------------------------------------
    1/5 ziro trust to this website
    ziro trust to this website, I did sign in and creat an account, find my hotel, made reservation step by step, gave all information and at least credit card number and reserved, suddenly the page disappear. no E-mail, attention, such a good website.
    --------------------------------------------------------------------------------
    5/5 Trivago is the best option on the market
    Not clear why so many negative feedbacks.
    Trivago is the way to go to choose for your hotel.
    There is no better place for you to compare the prices of so many booking sites. The UI can be improved but honestly it's great. Honest review
    --------------------------------------------------------------------------------
    5/5 We booked through An online booking…
    We booked through An online booking agency Via trivago, the other agency, aroma or something, went out of business some six months before our holiday but the first we knew was when we went to book into our hotel. They told us that the booking had been cancelled months earlier because the company had not sent the money we paid to them and just told them that they were going out of business, we had to pay for our holiday again at a higher price... our Travel insurance Company told us “not our problem” so we were stuck. After our holiday we contacted Trivago and we got a refund of what we paid aroma, it took some time because of this corana thing but we understood that and it was great that they honoured the booking faulted bu the other company. Also the communication between us and trivago was sensational, They answered our concerns within 12 hours or so, which is great since they are on the other side of the world..well done Trivago and thank you...😀😀😀
    --------------------------------------------------------------------------------
    
    ...