I have spend the last two days perfecting a list of urls from a site that I want to crawl. My script (which is basically identical to the example for CheerioCrawler except for the data extraction) is working, but there is a problem. Some of the documents that are saved locally in the data store are incomplete. From the example script, this would be, for instance, that the title field is blank in some of the saved data. In others it's all saved. The only field to get saved every time is url: request.url
.
My best guess is that the domain I'm crawling is very slow, with multiple scripts loaded from other domains, and Cheerio is just blasting through and not waiting for the whole page to be fully loaded before it extracts whatever data it can find, and moving on.
The total number of pages to crawl is about 2500, so I don't mind if the process is slow, but I'd like to make sure it's complete.
How can I ensure the page is fully loaded before it's extracted? I thought that the async
function would do that automatically.
The potential problem is that the webpage loads some content using asynchronous XHR calls made with JavaScript. With the CheerioScraper you will get data from the first request on that site. If you want to load asynchronous content, you need to use the browser to open the page.
You can do it merely with using PuppeteerCrawler. It has quite a similar interface as CheerioCrawler. It opens webpage for each request. You can use there various waitFor functions from puppeteer page interface to wait for the content you want to get.