Search code examples
web-scrapingasciichromiumheadlessgoogle-chrome-headless

Headless Chrome: website Div content to Text, toString or ASCII


I want to scrape text from a dynamically loaded website for which I need dynamic scraping. Because of dynamic loading, options such as $ lynx --dump google.com do not seem to work. For this I have used Headless Chrome such that

$ Chrome --headless --disable-gpu --no-sandbox --run-all-compositor-stages-before-draw --virtual-time-budget=1000 --window-size=1200,3000 --screenshot http://mtv.com

but I cannot find an option to scrape the text out of the website. I am available to all dynamic scraping options to get the text of specific div with some class for instance.

How can I scape text from a dynamically-loaded website?

Example result by the dynamic loading using headless chrome enter image description here


Solution

  • If you can write JS for Node.js, you can try puppeteer, Node.js library to manage headless Chrome:

    'use strict';
    
    const puppeteer = require('puppeteer');
    
    (async function main() {
      try {
        const browser = await puppeteer.launch({ headless: true });
        const [page] = await browser.pages();
    
        await page.goto('http://www.mtv.com/');
    
        const data = await page.evaluate(() => {
          return document.querySelector('div.header').innerText;
        });
    
        console.log(data);
    
        await browser.close();
      } catch (err) {
        console.error(err);
      }
    })();
    
    

    Output:

    teen mom 2