Search code examples
node.jsxmlweb-scraping

getting the link to specific .xml from sitemap.xml of a website


i have a website sitemap.xml and the structure is as below:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
            <loc>https://www.example.com/sitemap/Main-8531739688368880386.xml</loc>
        </sitemap>
    <sitemap>
            <loc>https://www.example.com/sitemap/Product-8073944469920756310.xml</loc>
        </sitemap>
    </sitemapindex>

i get the above sitemap.xml only after i load www.example.com/sitemap.xml. i want to get the sitemapindexed items there as they keep changing, thereby grabbing the <loc> tags after www.example.com/sitemap.xml has loaded , so as i can work with the Product-8073944469920756310.xml link there.

any solutions?


Solution

  • If I understand you correctly, you can grab it using xpath. For example:

    let
         xpath = require('xpath'), 
         dom = require('xmldom').DOMParser,
         xml = `your xml above`,
         doc = new dom().parseFromString(xml),
         nodes = xpath.select("//*[local-name()='loc']/text()", doc)
    console.log(nodes[1].data)
    

    Output:

    https://www.example.com/sitemap/Product-8073944469920756310.xml