Search code examples
pythoncompressiongziphdf

Python: Read compressed (.gz) HDF file without writing and saving uncompressed file


I have a large number of compressed HDF files, which I need to read.

file1.HDF.gz
file2.HDF.gz
file3.HDF.gz
...

I can read in uncompressed HDF files with the following method

from pyhdf.SD import SD, SDC
import os

os.system('gunzip < file1.HDF.gz >  file1.HDF')
HDF = SD('file1.HDF')

and repeat this for each file. However, this is more time consuming than I want.

I'm thinking its possible that most of the time overhang comes from writing the compressed file to a new uncompressed version, and that I could speed it up if I simply was able to read an uncompressed version of the file into the SD function in one step.

Am I correct in this thinking? And if so, is there a way to do what I want?


Solution

  • According to the pyhdf package documentation, this is not possible.

    __init__(self, path, mode=1)
      SD constructor. Initialize an SD interface on an HDF file,
      creating the file if necessary.
    

    There is no other way to instantiate an SD object that takes a file-like object. This is likely because they are conforming to an external interface (NCSA HDF). The HDF format also normally handles massive files that are impractical to store in memory at one time.

    Unzipping it as a file is likely your most performant option.

    If you would like to stay in Python, use the gzip module (docs):

    import gzip
    import shutil
    with gzip.open('file1.HDF.gz', 'rb') as f_in, open('file1.HDF', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)