Search code examples
node.jsweb-scrapingpuppeteercheerio

Can't scrape all the data cheerio - node.js


js noob here,

I'm trying to create a web scraper to scrape price data off booking websites, but I can't get the data I want, at least not every time.

I'm testing this specific url:

https://www.trivago.fr/?aDateRange%5Barr%5D=2019-10-09&aDateRange%5Bdep%5D=2019-10-10&aPriceRange%5Bfrom%5D=0&aPriceRange%5Bto%5D=0&iRoomType=7&aRooms%5B0%5D%5Badults%5D=2&cpt2=22748%2F200&iViewType=0&bIsSeoPage=0&sortingId=1&slideoutsPageItemId=&iGeoDistanceLimit=20000&address=&addressGeoCode=&offset=0&ra=

This is what I get 1 in 20 attempts:

{ prices:
   [ 'Prix / nuit',
     'Hébergement',
     'Avis',
     'Emplacement',
     'Autres',
     'max. 500€+',
     'Bien',
     '108€',
     'Bien',
     '112€',
     'Excellent',
     '98€',
     'Très bien',
     '122€',
     'Très bien',
     '164€',
     'Excellent',
     '156€',
     'Très bien',
     '97€',
     'Très bien',
     '160€',
     'Très bien',
     '155€',
     ' ',
     '87€',
     'Excellent',
     '134€',
     'Très bien',
     '155€',
     ' ',
     '92€',
     'Excellent',
     '135€',
     'Très bien',
     '135€',
     'Excellent',
     '94€',
     ' ',
     '82€',
     'Très bien',
     '98€',
     'Excellent',
     '99€',
     'Bien',
     '110€',
     'Bien',
     '141€',
     ' ',
     '80€',
     'Très bien',
     '136€',
     'Excellent',
     '122€',
     'Excellent',
     '232€',
     '1',
     'trivago N.V.' ] }

and this is what I get most of the time:

{ prices:
   [ 'Prix / nuit',
     'Hébergement',
     'Avis',
     'Emplacement',
     'Autres',
     'max. 500€+',
     'trivago N.V.' ] }

I've been told it might have something to do with speed with which the data is gathered, the code will end running before all the data is retrieved

Code:

const puppeteer = require('puppeteer');
let cheerio = require('cheerio');
let jsonframe = require('jsonframe-cheerio');

const server = http.createServer((req, res) => {
    res.statusCode = 200;
    res.setHeader('Content-Type', 'text/plain');
    res.end('Hello World\n');
});

server.listen(port);

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    let frame;
    
    await page.goto('https://www.trivago.fr/?aDateRange%5Barr%5D=2019-10-09&aDateRange%5Bdep%5D=2019-10-10&aPriceRange%5Bfrom%5D=0&aPriceRange%5Bto%5D=0&iRoomType=7&aRooms%5B0%5D%5Badults%5D=2&cpt2=22748%2F200&iViewType=0&bIsSeoPage=0&sortingId=1&slideoutsPageItemId=&iGeoDistanceLimit=20000&address=&addressGeoCode=&offset=0&ra=');
    let bodyHTML = await page.evaluate(() => document.body.innerHTML).then(frame = {"prices": ["strong"]});
    let $ = cheerio.load(bodyHTML);
    jsonframe($);
    var postsList = $('body').scrape(frame);
    console.log(postsList);
    await browser.close();
})();

Solution

    1. Website you're parsing, Trivago, use AJAX with the following requests: https://cdn-hs-graphql-dus.trivago.com/graphql. You can parse the response using JSON parser, so you should learn about it if you doesnt want to parse with Puppeteer.
    2. If you doesn't want to spying those request (using chrome devtools), i suggest you to use Puppeteer. In puppeteer, you can use the waitForSelector method. For example, if you want to get some hotels name and price, you can wait the selector to be available in the DOM, or just wait for some seconds.
    3. If you want to extract some data with jsonframe, you should also learn some more about CSS selector. I prefer to use [itemtype=""] and [itemprop=""] attribute since this selector is reliable and fast to find.
      https://css-tricks.com/how-css-selectors-work/
    4. To display the data, you can use console.log, but if you prefer to use node.js as server, i suggest you to use express.
    5. To make your script running fast, you can block images from request, by using interceptor.
    6. In your code above, you're missing the link between cheerio and jsonframe jsonframe($)
    7. You can use these codes as an example.
    (async () => {
    
        const http = require('http');
        const puppeteer = require('puppeteer');
        const cheerio = require('cheerio');
        const jsonframe = require('jsonframe-cheerio');
    
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
    
        page.setDefaultNavigationTimeout(0);
        page.setRequestInterception(true);
    
        page.on('request', async request => {
            if ( request.resourceType() === 'image' || request.resourceType() === 'media' ) {
                request.abort();
            } else {
                request.continue();
            }
        });
    
        const response = await page.goto('https://www.trivago.fr/?aDateRange%5Barr%5D=2019-10-09&aDateRange%5Bdep%5D=2019-10-10&aPriceRange%5Bfrom%5D=0&aPriceRange%5Bto%5D=0&iRoomType=7&aRooms%5B0%5D%5Badults%5D=2&cpt2=22748%2F200&iViewType=0&bIsSeoPage=0&sortingId=1&slideoutsPageItemId=&iGeoDistanceLimit=20000&address=&addressGeoCode=&offset=0&ra=');
        const waitForAJAXComplete = await page.waitForSelector('h3[itemprop="name"]>span.item-link');
        const bodyHTML = await page.content();
        const exit = await browser.close();
    
        const $ = await cheerio.load(bodyHTML);
    
        jsonframe($)
    
        let frame = {
            hotels : {
                _s : "[itemtype='https://schema.org/Hotel']",
                _d : [{
                    "hotelname" : "h3[itemprop='name']>span.item-link",
                    "hotelprice": "meta[itemprop='price'] ~ em ~ strong"
                }]
            }
        };
    
        const displayResult = $('body').scrape(frame, { string: true } );
    
        if (displayResult.length > 0) {
            const responseCode = '200';
            const server = http.createServer(function(req,res){
                res.writeHead(responseCode, {'Content-Type': 'application/json; charset=utf-8'});
                res.end(displayResult);
            }).listen(3000);
    
            console.log('Node Server running on localhost:3000');
        } else {
            console.log('Node Server cannot running, because results cannot be parsed.')
        }
    
    })();