Search code examples
pythonparsingstreamsitemapgunzip

How to parse compressed sitemap using python without downloading it to disk?


I want to parse compressed sitemap like www.example.com/sitemap.xml.gz and collect all the urls in sitemap without downloading sitemap.xml.gz.

there are ways to parse it after downloading sitemap.xml.gz and de-compressing it with help of lxml or beautifulsoup etc.

def parse_sitemap_gz(url):
    r = requests.get(url, stream=True)
    if 200 != r.status_code:
    return False
    file_name = url.split('/')[-1]

    # download the sitemap file
    with open(file_name, 'wb') as f:
    if not r.ok:
        print 'error in %s'%(url)
    for block in r.iter_content(1024):
        if not block:
           break
        f.write(block) # can I parse it without writing to file
        f.flush()

    # decompress gz file
    subprocess.call(['gunzip', '-f', file_name])

    # parse xml file
    page = lxml.html.parse(file_name[0:-3])
    all_urls = page.xpath('//url/loc/text()')
    #print all_urls

    # delete sitemap file now
    subprocess.call(['rm', '-rf', file_name[0:-3]])
    return all_urls

in this code I am writing compressed sitemap to file. my intension is not to write anything to file.
for learning and creating intelligent version of above code, how can I parse it with concept of decompressing gzip streams so I wont need to download file or write it to file?


Solution

  • If the only requirement is not to write to disk, and the gzip'd file doesn't have any extensions that only the gunzip utility supports and fits into memory, then you can start with:

    import requests
    import gzip
    from StringIO import StringIO
    
    r = requests.get('http://example.com/sitemap.xml.gz')
    sitemap = gzip.GzipFile(fileobj=StringIO(r.content)).read()
    

    Then parse sitemap through lxml as you are...

    Note that it doesn't "chunk" the iterator as you might as well just get the whole file in a single request anyway.