I want to scrape text from a dynamically loaded website for which I need dynamic scraping. Because of dynamic loading, options such as $ lynx --dump google.com
do not seem to work. For this I have used Headless Chrome such that
$ Chrome --headless --disable-gpu --no-sandbox --run-all-compositor-stages-before-draw --virtual-time-budget=1000 --window-size=1200,3000 --screenshot http://mtv.com
but I cannot find an option to scrape the text out of the website. I am available to all dynamic scraping options to get the text of specific div with some class for instance.
How can I scape text from a dynamically-loaded website?
If you can write JS for Node.js, you can try puppeteer, Node.js library to manage headless Chrome:
'use strict';
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch({ headless: true });
const [page] = await browser.pages();
await page.goto('http://www.mtv.com/');
const data = await page.evaluate(() => {
return document.querySelector('div.header').innerText;
});
console.log(data);
await browser.close();
} catch (err) {
console.error(err);
}
})();
Output:
teen mom 2