Search code examples
apify

Crawling urls from multiple sitemap.xml files


I'm building an apify actor for a page where all the needed urls are stored in different sitemap.xml-files. The file names are static, but can't figure out how to add several sitemap.xml files to the actor.

Down below are the working code with 1 xml-file. Somehow needs to either do a for-each on multiple urls, but as there are about 600 of them, it would preferably by something like extracting all sitemaps from a csv, and then crawl each for urls, and then crawl each url.

const Apify = require('apify');
const cheerio = require('cheerio');
const requestPromised = require('request-promise-native');

Apify.main(async () => {

    const xml = await requestPromised({
        url: 'https://www.website.com/sitemap1.xml’, // <- This part needs to accept input of about 600 sitemap.xml urls in total

        headers: {
        'User-Agent': 'curl/7.54.0'
        }
     });

    // Parse sitemap and create RequestList from it
    const $ = cheerio.load(xml);
    const sources = [];
    $('loc').each(function (val) {
        const url = $(this).text().trim();
        sources.push({
            url,
            headers: {
                // NOTE: Otherwise the target doesn't allow to download the page!
                'User-Agent': 'curl/7.54.0',
            }
        });
    });

    const requestList = new Apify.RequestList({
        sources,
    });
    await requestList.initialize();

    // Crawl each page from sitemap
    const crawler = new Apify.CheerioCrawler({
        requestList,
        handlePageFunction: async ({ $, request }) => {

            await Apify.pushData({
                url: request.url
            });
        },
    });

    await crawler.run();
    console.log('Done.');
});

Each sitemap.xml has a static link/name, but their content is changing daily, and the total amount of urls in the sitemaps are 60-70.000, its those urls that I ultimately needs to fetch :-)


Solution

  • The most reliable way to do this is to use the power of Apify Crawler classes. There are of course many ways to handle this problem.

    The simplest solution would be to use one CheerioCrawler and have separate logic in handlePageFunction for sitemap URLs and the final URLs. Unfortunately, CheerioCrawler is not able to parse XML (probably will be fixed in near future) so we will have to use 2 crawlers.

    For the first part of XML parsing, we will use BasicCrawler. It is the most generic of Apify's crawlers so it can easily use the code you already have. We will push the extracted URLs to a requestQueue and handle them in the second crawler which can stay mostly as it is.

    const Apify = require('apify');
    const cheerio = require('cheerio');
    const requestPromised = require('request-promise-native');
    
    Apify.main(async () => {
    
        // Here we will push the URLs found in the sitemaps
        const requestQueue = await Apify.openRequestQueue();
    
        // This would be better passed via INPUT as `const xmlUrls = await Apify.getInput().then((input => input.xmlUrls))`
        const xmlUrls = [
            'https://www.website.com/sitemap1.xml',
            // ...
        ]
    
        const xmlRequestList = new Apify.RequestList({
            sources: xmlUrls.map((url) => ({ url })) // We make smiple request object from the URLs
        })
    
        await xmlRequestList.initialize();
    
        const xmlCrawler = new Apify.BasicCrawler({
            requestList: xmlRequestList,
            handleRequestFunction: async ({ request }) => {
                // This is basically the same code you have, we just have to push the sources to the queue
                const xml = await requestPromised({
                    url: request.url,
                    headers: {
                        'User-Agent': 'curl/7.54.0'
                    }
                });
    
                const $ = cheerio.load(xml);
                const sources = [];
                $('loc').each(function (val) {
                    const url = $(this).text().trim();
                    sources.push({
                        url,
                        headers: {
                            // NOTE: Otherwise the target doesn't allow to download the page!
                            'User-Agent': 'curl/7.54.0',
                        }
                    });
                });
                for (const finalRequest of sources) {
                    await requestQueue.addRequest(finalRequest);
                }
            }
        })
    
        await xmlCrawler.run()
    
        // Crawl each page from sitemap
        const crawler = new Apify.CheerioCrawler({
            requestQueue,
            handlePageFunction: async ({ $, request }) => {
                // Add your logic for final URLs
                await Apify.pushData({
                    url: request.url
                });
            },
        });
    
        await crawler.run();
        console.log('Done.');
    });