Search code examples
pythonweb-scraping

How to scrape more than one page of critic reviews from Rotten Tomatoes?


I've been using this scraper to scrape critic reviews for this URL: https://www.rottentomatoes.com/m/avengers_endgame/reviews Although, I've been struggling with how to go through additional pages as this currently scrapes critic reviews of the first page. Does anyone know how I would go about this?

import selenium
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome()
driver.get("https://www.rottentomatoes.com/m/avengers_endgame/reviews")
review_1df = pd.DataFrame(columns=['Date', 'Reviewer', 'Website', 'Review', 'Score'])
dates = []
reviews = []
scores = []
newscores = []
names = []
sites = []
results = driver.find_elements_by_class_name("review_area")
reviewnum = 1
reviewers = driver.find_elements_by_class_name("col-xs-8")

for r in results:
    dates.append(r.find_element_by_class_name('subtle').text)
    reviews.append(r.find_element_by_class_name('the_review').text)
    revs = r.find_element_by_class_name('review_desc')
    scores.append(revs.find_element_by_class_name('subtle').text)
    
    for r in reviewers:
        names.append(r.find_element_by_xpath('//*[@id="reviews"]/div[2]/div[4]/div[' +str(reviewnum)+ ']/div[1]/div[3]/a[1]').text)
        sites.append(r.find_element_by_xpath('//*[@id="reviews"]/div[2]/div[4]/div[' +str(reviewnum)+']/div[1]/div[3]/a[2]/em').text)
        reviewnum+=1

for score in scores:
    if score == ('Full Review'):
        newscores.append('no score')
    else:
        score2 = score[14:]
        newscores.append(score2)
        
review_1df['Date'] = dates
review_1df['Review'] = reviews
review_1df['Score'] = newscores
review_1df['Reviewer'] = names
review_1df['Website'] = sites

Solution

  • You can use URL parameters to get to the next page of reviews and repeat the same steps. For example, the following url will take you to the second page of reviews:

    https://www.rottentomatoes.com/m/avengers_endgame/reviews?type=&sort=&page=2
    

    Note the parameters are type=&sort=&page=2 where you can also specify the sorting and type. Change it to page=3 to get to the third page.

    You'll also have to add a check to see if the page even exists. For example, you'll get no reviews on this URL:

    https://www.rottentomatoes.com/m/avengers_endgame/reviews?type=&sort=&page=200000