Search code examples
pythonsplitscikit-learnlarge-data

Stratified Shuffle Split for large files


I have a CSV file of 35GB (expected in future to be larger) for a binary classification problem in Keras. To train and test my model I want to split the data into train/test datasets having the same proportion of positive samples in each one. Something like this:

|Dataset type | Total samples | negative samples | positive instances |
|-------------|---------------|------------------|--------------------|
|Dataset      |    10000      |        8000      |       2000         |
|Train        |    7000       |        6000      |       1000         |
|Test         |    3000       |        2000      |       1000         |

As this dataset is too large to fit into memory, I created a custom generator to load the data in batch and to train the model via fit_generator. Therefore, I cannot apply the StratifiedShuffleSplit method from Scikitlearn to do this as it needs the entire dataset, instead of only a portion of the data, to keep the proportion of positive instances of both train and test datasets.

Edit: My data has the following shape: 11500 x 160000

Does anyone know how could I do what I want?

Solution

I followed Ian Lin's answer stet by step. Just note that if you have a large number of columns, converting a Dataframe into hdf5 may fail. Thus, create the hdf5 file directly from a numpy array

Also, to append data to an hdf5 file I had to do the following (set maxshape=None to every dimension of your dataset which you want to resize without limits. In my case, I resize the dataset to append unlimited rows with a fixed column number):

path = 'test.h5'
mydata = np.random.rand(11500, 160000)
if not os.path.exists(path):
    h5py.File(path, 'w').create_dataset('dataset', data=mydata, maxshape=(None, mydata.shape[1]))
else:
    with h5py.File(path, 'a') as hf:
        hf['dataset'].resize(hf['dataset'].shape[0] + mydata.shape[0], axis=0)
        hf["dataset"][-mydata.shape[0]:, :] = mydata

Solution

  • I usually do this:

    1. store the data into a file like numpy.memmap or HDF5 dataset (If your dataset has a large number of features, use h5py instead of pandas.DataFrame.to_hdf() or pytables)
    2. generate an integer index using something like this range(dataset.shape[0])
    3. use the split function in sklearn to split the integer index into train/test
    4. pass the integer index into your generator, and use the integer index to find the data in the h5py.Dataset or numpy.memmap

    If you are using keras.image.ImageDataGenerator.flow() as the generator, you can refer to a helper I wrote here to reindex the data easier.