Search code examples
javascriptweb-scrapingweb-crawlerpuppeteerheadless-browser

Get complete web page source html with puppeteer - but some part always missing


I am trying to scrape specific string on webpage below :

https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl;

The info I want to get from this web page source is the number serial in string below (that is something I can search when right-click mouse ->

"View Page source"): 
 name="nr_rooms_4377601_232287150_0_1_0"/ name="nr_rooms_4377601_232287150_1_1_0" 

I am using "puppeteer" and below is my code :

const puppeteer = require('puppeteer');
(async() => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //await page.goto('https://example.com');
    const response = await page.goto("My-url-above");
    let bodyHTML = await page.evaluate(() => document.body.innerHTML);
    let outbodyHTML = await page.evaluate(() => document.body.outerHTML);
    console.log(await response.text());
    console.log(await page.content());
    await browser.close();
})()

But I cannot find the strings I am looking for in response.text() or page.content().

Am I using the wrong methods in page ?

How can I dump the actual page source on the web page , the one exactly the same as I right-click the mouse ?


Solution

  • If you investigate where these strings are appearing then you can see that in <select> elements with a specific class (.hprt-nos-select):

    <select
      class="hprt-nos-select"
      name="nr_rooms_4377601_232287150_0_1_0"
      data-component="hotel/new-rooms-table/select-rooms"
      data-room-id="4377601"
      data-block-id="4377601_232287150_0_1_0"
      data-is-fflex-selected="0"
      id="hprt_nos_select_4377601_232287150_0_1_0"
      aria-describedby="room_type_id_4377601 rate_price_id_4377601_232287150_0_1_0 rate_policies_id_4377601_232287150_0_1_0"
    >
    

    You would wait until this element is loaded into the DOM, then it will be visible in the page source as well:

    await page.waitForSelector('.hprt-nos-select', { timeout: 0 });
    

    BUT your issue actually lies in the fact, that the url you are visiting has some extra URL parameters: ?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl; which are not taken into account by puppeteer (you can take a full page screenshot and you will see that it still has the default hotel search form without the specific hotel offers, and not the ones you are expecting).

    You should interact with the search form with puppeteer (page.click() etc.) to set the dates and the origin country yourself to achieve the expected page content.