Search code examples
pythonxmlparsing

Parse XML Sitemap with Python


I have a sitemap like this: http://www.site.co.uk/sitemap.xml which is structured like this:

<sitemapindex>
  <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
  <sitemap>
    <loc>
    http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml
    </loc>
    <lastmod>2015-07-07</lastmod>
  </sitemap>
...

And I want to extract data from it. First of all I need to count how many <sitemap> are in the xml and then for each of them, extract the <loc> and <lastmod> data. Is there an easy way to do this in Python?

I've seen other questions like this but all of them extract for example every <loc> element inside the xml, I need to extract data individually from each element.

I've tried to use lxml with this code:

import urllib2
from lxml import etree

u = urllib2.urlopen('http://www.site.co.uk/sitemap.xml')
doc = etree.parse(u)

element_list = doc.findall('sitemap')

for element in element_list:
    url = store.findtext('loc')
    print url

but element_list is empty.


Solution

  • I chose to use Requests and BeautifulSoup libraries. I created a dictionary where the key is the url and the value is the last modified date.

    from bs4 import BeautifulSoup
    import requests
    
    xml_dict = {}
    
    r = requests.get("http://www.site.co.uk/sitemap.xml")
    xml = r.text
    
    soup = BeautifulSoup(xml, "lxml")
    sitemap_tags = soup.find_all("sitemap")
    
    print(f"The number of sitemaps are {len(sitemapTags)}")
    
    for sitemap in sitemap_tags:
        xml_dict[sitemap.findNext("loc").text] = sitemap.findNext("lastmod").text
    
    print(xml_dict)
    

    Or with lxml:

    from lxml import etree
    import requests
    
    xml_dict = {}
    
    r = requests.get("http://www.site.co.uk/sitemap.xml")
    root = etree.fromstring(r.content)
    print(f"The number of sitemap tags are {len(root)}")
    for sitemap in root:
        children = sitemap.getchildren()
        xml_dict[children[0].text] = children[1].text
    print(xml_dict)