Search code examples
node.jsweb-scrapingrequestcheeriorotten-tomatoes

No data returned from scraping rottentomatoes


I cant figure out how to scrape the following data from https://www.rottentomatoes.com/browse/in-theaters/

Movie Title
Review Score
Release date
Link to movie details
Link to movie poster

I'm not getting back any data or entering my each loop.

Rotten Tomatoes screenshot

My code:

var cheerio = require("cheerio");
var request = require("request");

// Make a request call to grab the HTML body from the site of your 
choice
request("https://www.rottentomatoes.com/browse/in-theaters/", 
function(error, response, html) {

  // Load the HTML into cheerio and save it to a variable
 // '$' becomes a shorthand for cheerio's selector commands, much like 
 jQuery's '$'
 var $ = cheerio.load(html);

 // An empty array to save the data that we'll scrape
 var results = [];

 // Select each element in the HTML body from which you want 
 information.
  // NOTE: Cheerio selectors function similarly to jQuery's selectors,
  // but be sure to visit the package's npm page to see how it works
  $('mb-movie').each(function(i, element) {
    console.log("inside each");
    console.log($(element));
    var link = $(element).children().attr("href");
var title = $(element).find('h3').text();


// // Save these results in an object that we'll push into the results array we defined earlier
results.push({
  title: title,
  link: link
});

  });

 // Log the results once you've looped through each of the elements 
 found with cheerio
  console.log(results);
 });

Solution

  • Firstly, your selector is incorrect because it is missing the dot prefix for class names i.e. $(".mb-movie").

    But even if you correct the selector, it still won't match anything because the movies are dynamically rendered on the page using JS after the page loads. You can test this by doing a "view source" on the page in browser and searching for the selector mb-movie - you'll not find any. The mb-movie elements are dynamically added by media-browser.js which is executed as part of the page. Request is not a web-browser, it can only download the raw HTML.

    RT used to have an API that seems to be no longer available. There is one from fandango but I doubt you'll get what you need from it.

    One other option could be to use a website automation library like Selenium or PhantomJS which run a headless browser to actually load the page and are programmable. See this part of the docs for more info: http://phantomjs.org/page-automation.html#dom-manipulation

    We don't know what you're doing this for, but note that RT's terms of use explicitly forbid retrieval for the purpose of creating a database of some sort. Thanks @DaveNewton