Search code examples
hdf5training-datahdf

Split 1 HDF file into 2 HDF files at the ratio of 90:10


I am trying to process data to train a model. I have a dataset processed and saved in a HDF5 file (original HDF file) to separate into two unoverlapping HDF files at the ratio 90:10.

I would like to separate data stored in that HDF file into two other HDF i.e. one HDF for training purpose which contains 90% of dataset in original HDF file and another HDF for validation purpose which contains 10% of dataset in original HDF file. If you have any ideas to do it, please guide me. Thank you so much in advance.


Solution

  • You don't have to separate the data into separate files for training and testing. (In fact, to properly train your model, you would have to do this multiple times -- randomly dividing the data into different training and testing sets each time.)

    One option is to randomize the input when you read the data. You can do this by creating 2 lists of indices (or datasets). One list is the training data, and the other is the test data. Then, iterate over the lists to load the desired data.

    Alternately (and probably simpler), you can use the h5imagegenerator from PyPi. Link to the package description here: pypi.org/project/h5imagegenerator/#description

    If you search SO, you will find more answers on this topic:

    Hope that helps. If you still want to know how to copy data from 1 file to another, take a look at this answer. It shows multiple ways to do that: How can I combine multiple .h5 file? You probably want to use Method 2a. It copies data as-is.