tensorflow pytorch dataset numpy-memmap batchsize

memmap arrays to pytorch and gradient accumulation

I have A Large dataset (> 62 GiB) after processing saved as two NumPy.memmap arrays one of the data and the other for the labels the dataset has these shapes (7390,60,224,224,3) , and (7390) and is NOT shuffled so i need to shuffle it first.

now i use tensorflow2 and used this code with my generator to manage memmap arrays before

def my_generator():
    for i in range(len(numpy_array)):
          yield numpy_array[i,:,:,:,:],np.array(labels[i]).reshape(1)

full_dataset = tf.data.Dataset.from_generator(
    generator=my_generator,
    output_types=(np.uint8,np.int32),
    output_shapes=((60,224,224,3),(1))
)

full_dataset = full_dataset.shuffle(SHUFFLE_BUFFER_SIZE, reshuffle_each_iteration=False)
train_dataset = full_dataset.take(train_size)
test_dataset = full_dataset.skip(train_size)
val_dataset = test_dataset.skip(test_size)
test_dataset = test_dataset.take(test_size)

That way i can train without loading to memory the entire dataset with shuffling and batching.

Now with this current model and dataset the vram is not enogh for more than 2 batches to be loaded as tensors. and i can't train with batchsize of 2.

i thought of gradient accumulation but i couldn't do it with TF2 and i found it easy with pytorch but i can't find how to deal with the memmap arrays with shuffle and split as in tensorflow with generators.

so i need to know how to load the datset from pytorch with the same shuffling and batching in pytorch.

Or if someone has a readymade code for GA on TF2

Solution

I will just address the shuffle question.

Instead of shuffling with tf.data.Dataset, do it at the generator level. This should work:

class Generator(object):
    def __init__(self, images, labels, batch_size):
        self.images = images
        self.labels = labels
        self.batch_size  = batch_size
        self.idxs = np.arange(len(self.images))
        self.on_epoch_end()

    def on_epoch_end(self):
        # Shuffle the indices
        np.random.shuffle(self.idxs)

    def generator(self):
        i = 0
        while i < len(self.idxs):
            idx = self.idxs[i]
            yield (self.images[idx], self.labels[i])
            i += 1
        self.on_epoch_end()

    def batch_generator(self):
        it = iter(self.generator)
        while True:
            vals = [next(it) for i in range(self.batch_size)]
            images, labels = zip(*vals)
            yield images, labels

Then you can use it by

gen = Generator(...)
it = iter(gen)

batch = next(it)  # Call this every time you want a new batch

I'm sure pytorch has build in methods for this kind of stuff though