Search code examples
tensorflowpytorchdatasetnumpy-memmapbatchsize

memmap arrays to pytorch and gradient accumulation


I have A Large dataset (> 62 GiB) after processing saved as two NumPy.memmap arrays one of the data and the other for the labels the dataset has these shapes (7390,60,224,224,3) , and (7390) and is NOT shuffled so i need to shuffle it first.

now i use tensorflow2 and used this code with my generator to manage memmap arrays before

def my_generator():
    for i in range(len(numpy_array)):
          yield numpy_array[i,:,:,:,:],np.array(labels[i]).reshape(1)

full_dataset = tf.data.Dataset.from_generator(
    generator=my_generator,
    output_types=(np.uint8,np.int32),
    output_shapes=((60,224,224,3),(1))
)

full_dataset = full_dataset.shuffle(SHUFFLE_BUFFER_SIZE, reshuffle_each_iteration=False)
train_dataset = full_dataset.take(train_size)
test_dataset = full_dataset.skip(train_size)
val_dataset = test_dataset.skip(test_size)
test_dataset = test_dataset.take(test_size) 

That way i can train without loading to memory the entire dataset with shuffling and batching.

Now with this current model and dataset the vram is not enogh for more than 2 batches to be loaded as tensors. and i can't train with batchsize of 2.

i thought of gradient accumulation but i couldn't do it with TF2 and i found it easy with pytorch but i can't find how to deal with the memmap arrays with shuffle and split as in tensorflow with generators.

so i need to know how to load the datset from pytorch with the same shuffling and batching in pytorch.

Or if someone has a readymade code for GA on TF2


Solution

  • I will just address the shuffle question.

    Instead of shuffling with tf.data.Dataset, do it at the generator level. This should work:

    class Generator(object):
        def __init__(self, images, labels, batch_size):
            self.images = images
            self.labels = labels
            self.batch_size  = batch_size
            self.idxs = np.arange(len(self.images))
            self.on_epoch_end()
    
        def on_epoch_end(self):
            # Shuffle the indices
            np.random.shuffle(self.idxs)
    
        def generator(self):
            i = 0
            while i < len(self.idxs):
                idx = self.idxs[i]
                yield (self.images[idx], self.labels[i])
                i += 1
            self.on_epoch_end()
    
        def batch_generator(self):
            it = iter(self.generator)
            while True:
                vals = [next(it) for i in range(self.batch_size)]
                images, labels = zip(*vals)
                yield images, labels
    

    Then you can use it by

    gen = Generator(...)
    it = iter(gen)
    
    batch = next(it)  # Call this every time you want a new batch
    

    I'm sure pytorch has build in methods for this kind of stuff though