Search code examples
pythoncorrupttarfile

open corrupt tar file with python


I am downloading tar files from a ftp server with the help of python. However, now I am facing problems and getting the error "ReadError: unexpected end of data". I assume my file got corrupted. I can open the files outside python with the comment 'wget' inside the terminal, however I would like to only stick to python. This is my code:

os.chdir(aod_ipng)
[urlretrieve('%s%s'%(url_ipng,x),'%s'%(x)) for x in ari]

for i in range(len(ari)):
    fileName = '%s'%(ari[i])
    ind = save_ipng[i].index('IVAOT')
    h5f = save_ipng[i][ind:]
    tfile = tarfile.open(fileName,'r|')
    for t in tfile:
        if t.name == '%s'%h5f:
            f = tfile.extract(t)
  • ari is a string array that holds several names of tar files that needs to be downloaded.
  • h5f is the name of the specific h5.gz file that needs to be extracted from the tar file Let me know if you need more information regarding my code!

Solution

  • Reliable downloads of large files over bad connections is not easy. If http range requests are supported then you can resume the download on broken connections.

    A good start is to use the requests library and read the remote file as a stream. However disconnects and resumes might still have to be handled by you.

    See this question for how to use that API

    But please make sure that the file is indeed a tar. You can use libmagic for file format detection.

    That file extension suggests a gzip not a tar.

    import gzip
    f = gzip.open('h5.gz', 'rb')
    file_content = f.read()
    f.close()