Search code examples
pythonweb-scrapingpaginationtrustpilot

Trustpilot Reviews Scraper Pagination Not Working


I've been trying to scrape customer reviews about DoorDash from Trustpilot, from several pages, but for some reason, it only scrapes the first page over and over again (it seems that pagination isn't working)! Here's my code:

review_text=[]
review_score=[]
review_date=[]
review_title=[]

pages = np.arange(1, 10, 1)
for page in pages:
    page = requests.get("https://www.trustpilot.com/review/doordash.com" + "?page=" + str(page))
    sleep(randint(2,10))
    if response.status_code == 200:
        soup = bs4.BeautifulSoup(response.text)
        for rev in soup.find_all('div',class_="review-content"):
            nv = rev.find_all('p',class_= 'review-content__text')
            review = rev.p.text.strip() if len(nv) == True else '-'
            review_text.append(review)            
            date_json = json.loads(rev.find('script').string)
            date = date_json['publishedDate']
            review_date.append(date)
        for rev in soup.find_all('div',class_='star-rating star-rating--medium'):
            review_score.append(rev.find('img').get('alt'))
        for rev in soup.find_all('h2',class_='review-content__title'):
            review_title.append(rev.text.strip())
    else:
        print("Issue getting url")

Does anyone have any idea as to how I can fix this? (Everything else, aside from pagination, works perfectly) Thanks!


Solution

  • Pagination in Trustpilot is not done using page 1, page 2, you need to get the next page URL and scrape the content of it. In this example, you can see how you can get the next page URL to use page scraping

    base_url = "https://trustpilot.com/review/doordash.com"
    general= "https://trustpilot.com"
    Numberpage=20
    for i in range(1,Numpages):
        page = requests.get(base_url, verify=False)
        tree = html.fromstring(page.content)
        next_page = tree.xpath("//a[contains(@class, 'next-page')]")
        if next_page:
            base_url = general + next_page[0].get('href')
        #place the function that collects reviews from one page here
        scrape_page(base_url)