I want to parse compressed sitemap like www.example.com/sitemap.xml.gz and collect all the urls in sitemap without downloading sitemap.xml.gz.
there are ways to parse it after downloading sitemap.xml.gz and de-compressing it with help of lxml
or beautifulsoup
etc.
def parse_sitemap_gz(url):
r = requests.get(url, stream=True)
if 200 != r.status_code:
return False
file_name = url.split('/')[-1]
# download the sitemap file
with open(file_name, 'wb') as f:
if not r.ok:
print 'error in %s'%(url)
for block in r.iter_content(1024):
if not block:
break
f.write(block) # can I parse it without writing to file
f.flush()
# decompress gz file
subprocess.call(['gunzip', '-f', file_name])
# parse xml file
page = lxml.html.parse(file_name[0:-3])
all_urls = page.xpath('//url/loc/text()')
#print all_urls
# delete sitemap file now
subprocess.call(['rm', '-rf', file_name[0:-3]])
return all_urls
in this code I am writing compressed sitemap to file. my intension is not to write anything to file.
for learning and creating intelligent version of above code, how can I parse it with concept of decompressing gzip streams so I wont need to download file or write it to file?
If the only requirement is not to write to disk, and the gzip'd file doesn't have any extensions that only the gunzip
utility supports and fits into memory, then you can start with:
import requests
import gzip
from StringIO import StringIO
r = requests.get('http://example.com/sitemap.xml.gz')
sitemap = gzip.GzipFile(fileobj=StringIO(r.content)).read()
Then parse sitemap
through lxml
as you are...
Note that it doesn't "chunk" the iterator as you might as well just get the whole file in a single request anyway.