Search code examples
pythonhdf5

Reading an hdf5 file only after it has completely finished acquiring data


Data will be saved into hdf5 files but the saving takes roughly 30 seconds in total for one file. Once the data is done being saved in one hdf5 file, the file will be used immediately until the next hdf5 file is done and the process will continue like so. Is there a simple way to check if an hd5 file is done loading and only then can it be used? The hdf5 files are roughly 10-20MB and will all be saved in the same folder. Of course I could perhaps set a timer above 30 seconds of some sort but I am interested in keeping the time as low as possible which means I need to know exactly when each hdf5 file is finished acquiring data.

A couple of ideas I have:

  1. Measuring the difference in file size from one point in time to another. If there is no change then it is assumed the file is done loading.
  2. I don't know much about hdf5 files but perhaps there is something at the end of every hdf5 file and only at the end. If that is the case, I could keep checking if the values of the last component is there. If it is there then the file must be finished.

Any thoughts? I would definitely appreciate any help.

Edit: My idea with the hdf5 part inside on_created:

class CustomHandler(FileSystemEventHandler):    

    def __init__(self, callback: Callable):
        self.callback = callback

        # Store callback to be called on every on_created event

    def on_created(self, event: Union[DirCreatedEvent, FileCreatedEvent]):
        #print(f"Event type: {event.event_type}\nAt: {event.src_path}\n")

        # check if it's File creation, not Directory creation
        if isinstance(event, FileCreatedEvent):
            file = pathlib.Path(event.src_path)

            #print(f"Processing file {file.name}\n")

            # call callback
            #self.callback(file)

            wait = 3
            max_wait = 30
            waited = 0

            while True:
                try:
                    h5py.File(self.callback(file), 'r')
                    return self.callback(file)

                except FileNotFoundError:
                    print('Error: HDF5 File not found')
                    return None

                except OSError:
                    if waited < max_wait:
                        print(f'Error: HDF5 File locked, sleeping {wait} seconds...')
                        time.sleep(wait)
                        waited += wait
                    else:
                        print(f'waited too long= {waited} secs')
                        return None

Solution

  • Based on your comments and our discussion, the easiest implementation might be a function that "waits" for the file, but does not return the h5py file object. This way you still use the standard context manager: (e.g., with h5py.File() as h5f:) and avoid the need to close the file in the main program.

    I am posting the modified function as a new answer (renamed to h5_wait) to avoid confusion (my first answer has the original function h5_open_wait). This function is similar, but returns a True/False flag instead of a h5py file object. It checks the file status by calling h5py.File(), then closes before exiting the function. Also it uses a sys.argv to get the HDF5 filename (as sys.argv[1]).

    See new code below:

    import h5py
    import sys
    import time
    
    def h5_wait(h5file):
        
        wait = 3
        max_wait = 30
        waited = 0
    
        while True:
            try:
                h5f = h5py.File(h5file,'r')
                break
                    
            except FileNotFoundError:
                print('\nError: HDF5 File not found\n')
                return False
            
            except OSError:   
                if waited < max_wait:
                    print(f'Warning: HDF5 File locked, sleeping {wait} seconds...')
                    time.sleep(wait) 
                    waited += wait  
                else:
                    print(f'\nWaited too long= {waited} secs, exiting...\n')
                    return False
    
        h5f.close()
        return True
    
    ####################
    
    if len(sys.argv) != 2:
        sys.exit('Include HDF5 file name on command line.')
    h5file = sys.argv[1]         
    
    h5stat = h5_wait(h5file)
    if h5stat is False:
        sys.exit('Error: HDF5 File not available')
        
    with h5py.File(h5file) as h5f:
        # do something with the file      
        start = time.time()
        for ds, obj in h5f.items():
            print(f'ds name={ds}; shape={obj.shape}')
          
        print(f'\nTime to read {len(list(h5f.keys()))} datasets = {time.time()-start:.2f} secs')