Search code examples
pythonfilegzipurllib2stringio

Download and decompress gzipped file in memory?


I would like to download a file using urllib and decompress the file in memory before saving.

This is what I have right now:

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
outfile = open(outFilePath, 'w')
outfile.write(decompressedFile.read())

This ends up writing empty files. How can I achieve what I'm after?

Updated Answer:

#! /usr/bin/env python2
import urllib2
import StringIO
import gzip

baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"        
# check filename: it may change over time, due to new updates
filename = "man-pages-5.00.tar.gz" 
outFilePath = filename[:-3]

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO(response.read())
decompressedFile = gzip.GzipFile(fileobj=compressedFile)

with open(outFilePath, 'w') as outfile:
    outfile.write(decompressedFile.read())

Solution

  • You need to seek to the beginning of compressedFile after writing to it but before passing it to gzip.GzipFile(). Otherwise it will be read from the end by gzip module and will appear as an empty file to it. See below:

    #! /usr/bin/env python
    import urllib2
    import StringIO
    import gzip
    
    baseURL = "https://www.kernel.org/pub/linux/docs/man-pages/"
    filename = "man-pages-3.34.tar.gz"
    outFilePath = "man-pages-3.34.tar"
    
    response = urllib2.urlopen(baseURL + filename)
    compressedFile = StringIO.StringIO()
    compressedFile.write(response.read())
    #
    # Set the file's current position to the beginning
    # of the file so that gzip.GzipFile can read
    # its contents from the top.
    #
    compressedFile.seek(0)
    
    decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
    
    with open(outFilePath, 'w') as outfile:
        outfile.write(decompressedFile.read())