Understanding the Hardware usage when training a Classifier on a GPU

I am training a Cat-Dog Classifier with the help transfer learning on a GPU using TF 2.0. I used Keras ImageDataGenerator for performing data augmentation. While training the model, I monitored the usage of GPU, Disk(HDD) and CPU, and noted the following:-

For the first epoch, the Disk usage is at a constant 20-25%. However, near the end of the epoch, during the validation phase, it spikes to 40-50%. For the subsequent epochs, the Disk usage is strangely zero.
The CPU utilization is at ~60% throughout an epoch, except at the end when it spikes to 100%.
When I loaded any Keras Model (like VGG16,Xception,InceptionResNetV2 etc), all of them took about 6.5 gigs of VRAM.
The GPU usage is at a constant 29-30% for almost the entirety of an epoch, expect at the end when it spikes to 70%.

From these observations, I made the following inferences:-

During the First epoch, the images are loaded from the HDD to RAM in batches of 8(batch_size=8), and as soon as a batch is loaded, the CPU performs data augmentation on the batch and sends the augmented batch to the GPU for training. Now, when the validation stage comes, since there is no data augmentation to be performed on the Validation data, the CPU can directly pass the batch of image from RAM to GPU without any preprocessing. Thus, the overhead is reduced, and all 3 namely, Disk, CPU and GPU work at a higher speed, hence the higher usage in all of them during the end of an epoch.
For the subsequent epochs, since all the images(training+validation) have already been loaded onto RAM, there is no need to fetch it from the Disk, and so Disk usage remains zero.

However, there were a few things I could not wrap my head around:-

In the Keras Appplications Page, VGG19 has the largest size in terms of memory footprint (549 MB), and I can't understand how can this go on to consume about 6.5 GB of VRAM?
Why is it that all the models took the same amount of VRAM when loaded, even though they are vastly different in terms of size(both total number of layers and memory footprint) ?

Here are a few code snippets:-

train_datagen = ImageDataGenerator(rotation_range = 30,
                                   width_shift_range = 0.4,
                                   height_shift_range = 0.4,
                                   shear_range = 0.4,
                                   zoom_range = 0.25,
                                   horizontal_flip = True,
                                   brightness_range = [0.5, 1.5],
                                   preprocessing_function = preprocess_input) 

valid_datagen = ImageDataGenerator(preprocessing_function = preprocess_input)

train_generator = train_datagen.flow_from_dataframe(train_data,
                                                    directory = 'train/',
                                                    x_col = 'Photo',
                                                    y_col = 'Class',
                                                    target_size = (299,299),
                                                    class_mode = 'binary',
                                                    seed = 42,
                                                    batch_size = 8)

validation_generator = valid_datagen.flow_from_dataframe(valid_data,
                                                    directory = 'train/',
                                                    x_col = 'Photo',
                                                    y_col = 'Class',
                                                    target_size = (299,299),
                                                    class_mode = 'binary',
                                                    seed = 42,
                                                    batch_size = 8)

inception_resnet_v2 = InceptionResNetV2(include_top = False,
                                    weights = 'imagenet',
                                    input_shape = (299, 299, 3),
                                    pooling = 'avg',
                                    classes = 2)
inception_resnet_v2.trainable = False

out = Dense(1, activation = 'sigmoid')(inception_resnet_v2.output)

model = Model(inputs = inception_resnet_v2.inputs, outputs = out)

checkpoint = ModelCheckpoint('model.h5',
                             monitor = 'val_accuracy',
                             verbose = 0,
                             save_best_only = True,
                             save_weights_only = False, 
                             mode = 'max',
                             period = 1)

optim = tf.keras.optimizers.Adam(lr = 0.0001)
model.compile(optimizer = optim, loss = 'binary_crossentropy', metrics = ['accuracy'])

hist = model.fit_generator(train_generator,
                           steps_per_epoch = len(train_generator),
                           epochs = 10,
                           callbacks = [checkpoint],
                           validation_data = validation_generator,
                           verbose = 1,
                           validation_steps = len(validation_generator),
                           validation_freq = 1)

I would be grateful if someone could answer my questions and also point out, if my inferences were correct or wrong.

Thanks.

Solution

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process.

You can limit the same using,

import tensorflow as tf
from keras import backend as k

config = tf.ConfigProto() # TensorFlow wizardry
config.gpu_options.allow_growth = True # Don't pre-allocate memory; allocate as-needed
config.gpu_options.per_process_gpu_memory_fraction = 0.95 # Only allow a total fraction the GPU memory to be allocated
k.tensorflow_backend.set_session(tf.Session(config=config)) # Create a session with the above options specified.

Also you can check this for more information : https://www.tensorflow.org/guide/gpu