Search code examples
pythonurllibos.pathtarfile

tarfile can't open tgz


I am trying to download tgz file from this website: https://plg.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/foo07

here is my script:

import os
from six.moves import urllib
import tarfile

spam_path=os.path.join('ML', 'spam')
root_download='https://plg.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/foo07'
spam_url=root_download+'255 MB Corpus (trec07p.tgz)'

if not os.path.isdir(spam_path):
    os.makedirs(spam_path)

path=os.path.join(spam_path, 'trec07p.tgz')
if not os.path.isfile('trec07p.tgz'):
    urllib.request.urlretrieve(spam_url,path)
tar_file=tarfile.open(path)

I am not sure what I am missing but the following error is returned:

---------------------------------------------------------------------------
ReadError                                 Traceback (most recent call last)
<ipython-input-21-5644813e0670> in <module>()
     18 if not os.path.isfile('trec07p.tgz'):
     19     urllib.request.urlretrieve(spam_url,path)
---> 20 tar_file=tarfile.open(path)
     21 # tar_file.extractall(path)
     22 # tar_file.close()

/anaconda/lib/python2.7/tarfile.pyc in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1678                         fileobj.seek(saved_pos)
   1679                     continue
-> 1680             raise ReadError("file could not be opened successfully")
   1681 
   1682         elif ":" in mode:

ReadError: file could not be opened successfully

Thank you in advance for your help!


Solution

  • You can add additional parameters to tarfile.open. You need to set the mode to 'r:gz'.

    tarfile.open(path, 'r:gz')
    

    Working example after Accept Agreement:

    import tarfile
    
    import requests
    
    URL = 'https://plg.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/trec07p.tgz'
    FILE = '/home/blake/Downloads/trec07p.tgz'
    
    resp = requests.get(URL, stream=True)
    resp.raise_for_status()
    
    with open(FILE, 'wb') as out_file:
        for line in resp.iter_content(chunk_size=1024*4, decode_unicode=False):
            out_file.write(line)
    
    
    f = tarfile.open(FILE, 'r:gz')
    print(f.getnames())
    
    f.close()
    

    Output:

    ['trec07p/data/inmail.35059',
     'trec07p/data/inmail.34430',
     'trec07p/data/inmail.45722',
     ..
     ..]