python tensorflow machine-learning keras

How to avoid catastrophic forgetting in online training>

Im working with a Keras model in a binary-classification problem, the problem that im facing is that the training dataset has grown so much i can no longer fit it into memory, so i re-write my dataset generation unit to produce many numpy arrays and save them on the disk, (basically splitting the dataset on processable chunks) but what im seeing now is that the model seems to forget previous training data (since its tweaking its weights to adjust to the new data) so from 26k samples (each file is 500 samples) the model performs just based on the last 500 samples. This is a bit of the code that im using to train:

for fname in input_file_names:
    np_file = np.load(f"{TRAINING_FOLDER}/{fname}", mmap_mode='r')
    X = np_file['array1']
    y = np_file['array2']

    length_to_use = X.shape[0]
    reached_training_targets += X.shape[0]
    if reached_training_targets > NUM_SAMPLES:
        length_to_use -= (reached_training_targets - NUM_SAMPLES)

    if length_to_use <= 0:
        break

    print(f"Training batch on {length_to_use} samples from file {fname}...")
    X = X[:length_to_use]
    y = y[:length_to_use]

    rand_idx = np.random.permutation(X.shape[0])
    X = X[rand_idx]
    y = y[rand_idx]

    model.fit(X, y, epochs=EPOCHS, batch_size=32, verbose=2, callbacks=[early_stopping, lr_schedule])
    np_file.close()

I already tried reducing starting learning rate and different optimizer. This is a new problem for me, since as a few days ago the whole dataset was able to fit in memory, but now, i need to deal with this and i dont know what is the best technique.

Solution

I solved it:

for i, file_name in enumerate(input_file_names):
    input_file_handles[i] = np.load(f"{TRAINING_FOLDER}/{file_name}", mmap_mode='r')

def data_generator():
    global batch_size

    for i in range(NUM_SAMPLES):
        batch_X = np.empty((batch_size, 1000, 63), dtype=np.double)
        batch_y = np.empty((batch_size), dtype=np.uint8)
        for j, file_handle in enumerate(input_file_handles):
            batch_X[j] = file_handle['array1'][i]
            batch_y[j] = file_handle['array2'][i]

        yield batch_X, batch_y

I implemented a generator with the function fit_generator, and set the batch size to the amount of files needed to fulfill the NUM_SAMPLES requested, so each batch has one sample from each file, avoiding catastrophic forgetting.