I have a sitemap like this: http://www.site.co.uk/sitemap.xml which is structured like this:
<sitemapindex>
<sitemap>
<loc>
http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml
</loc>
<lastmod>2015-07-07</lastmod>
</sitemap>
<sitemap>
<loc>
http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml
</loc>
<lastmod>2015-07-07</lastmod>
</sitemap>
...
And I want to extract data from it. First of all I need to count how many <sitemap>
are in the xml and then for each of them, extract the <loc>
and <lastmod>
data. Is there an easy way to do this in Python?
I've seen other questions like this but all of them extract for example every <loc>
element inside the xml, I need to extract data individually from each element.
I've tried to use lxml
with this code:
import urllib2
from lxml import etree
u = urllib2.urlopen('http://www.site.co.uk/sitemap.xml')
doc = etree.parse(u)
element_list = doc.findall('sitemap')
for element in element_list:
url = store.findtext('loc')
print url
but element_list
is empty.
I chose to use Requests and BeautifulSoup libraries. I created a dictionary where the key is the url and the value is the last modified date.
from bs4 import BeautifulSoup
import requests
xml_dict = {}
r = requests.get("http://www.site.co.uk/sitemap.xml")
xml = r.text
soup = BeautifulSoup(xml, "lxml")
sitemap_tags = soup.find_all("sitemap")
print(f"The number of sitemaps are {len(sitemapTags)}")
for sitemap in sitemap_tags:
xml_dict[sitemap.findNext("loc").text] = sitemap.findNext("lastmod").text
print(xml_dict)
Or with lxml:
from lxml import etree
import requests
xml_dict = {}
r = requests.get("http://www.site.co.uk/sitemap.xml")
root = etree.fromstring(r.content)
print(f"The number of sitemap tags are {len(root)}")
for sitemap in root:
children = sitemap.getchildren()
xml_dict[children[0].text] = children[1].text
print(xml_dict)