I have a CSV file of 35GB (expected in future to be larger) for a binary classification problem in Keras. To train and test my model I want to split the data into train/test datasets having the same proportion of positive samples in each one. Something like this:
|Dataset type | Total samples | negative samples | positive instances | |-------------|---------------|------------------|--------------------| |Dataset | 10000 | 8000 | 2000 | |Train | 7000 | 6000 | 1000 | |Test | 3000 | 2000 | 1000 |
As this dataset is too large to fit into memory, I created a custom generator to load the data in batch and to train the model via fit_generator
. Therefore, I cannot apply the StratifiedShuffleSplit
method from Scikitlearn to do this as it needs the entire dataset, instead of only a portion of the data, to keep the proportion of positive instances of both train and test datasets.
Edit: My data has the following shape: 11500 x 160000
Does anyone know how could I do what I want?
I followed Ian Lin's answer stet by step. Just note that if you have a large number of columns, converting a Dataframe into hdf5 may fail. Thus, create the hdf5 file directly from a numpy array
Also, to append data to an hdf5 file I had to do the following (set maxshape=None
to every dimension of your dataset which you want to resize without limits. In my case, I resize the dataset to append unlimited rows with a fixed column number):
path = 'test.h5'
mydata = np.random.rand(11500, 160000)
if not os.path.exists(path):
h5py.File(path, 'w').create_dataset('dataset', data=mydata, maxshape=(None, mydata.shape[1]))
else:
with h5py.File(path, 'a') as hf:
hf['dataset'].resize(hf['dataset'].shape[0] + mydata.shape[0], axis=0)
hf["dataset"][-mydata.shape[0]:, :] = mydata
I usually do this:
pandas.DataFrame.to_hdf()
or pytables)range(dataset.shape[0])
If you are using keras.image.ImageDataGenerator.flow()
as the generator, you can refer to a helper I wrote here to reindex the data easier.