Search code examples
pythonhdf5scientific-computingh5py

Should I open/close a file repeatedly or keep it open for an extended period of time (~1 week)?


I'm implementing data collection for a Markov chain Monte Carlo inversion program. However, the MCMC runs can take a week or more to complete! Would it be better to open the file at the beginning of the run:

with h5py.File('my_data.hdf5', 'r+', libver='latest') as fp:
    fp.swmr_mode = True
    mcmc_run(fp)

Or each time I want to add a dataset (inside mcmc_run())

with h5py.File('my_data.hdf5', 'r+', libver='latest') as fp:
    fp.swmr_mode = True
    fp['dataset'] = new_data

I have to save about 7 mb over 9 datasets for each acceptance (500 total over about a week of computation time, ~5000 iterations). Unfortunately the data is coming from several different objects inside the iteration so I can't group them and open the file once per acceptance.


Solution

  • [Posting comment as an answer]

    For runs that take that long, you may want to consider what happens if you have a power outage (as an MC veteran, this is my biggest fear). I recommend closing and re-opening the file because it is probably safer, and less likely to leave the file vulnerable to corruption during a power outage, computer crash, etc. when running over many days.