Given a Keras model (on Colab) that has input shape (None,256,256,3) and batch_size is 16 then the memory allocated for that input shape is 16*256*256*3*datatype (datatype=2,4,8 depending on float16/32/64). This is how it works. My confusion is that for a given batch_size (=16) 1*256*256*3 could have been allocated and the 16 images could have been passed one by one and the final gradient could have been averaged.
1) So, is the allocation dependent on batch size so that 'batch_size' computations can be done in parallel and the configuration that I have mentioned above (1*256*256*3) would be serializing and hence defeating the purpose of GPU?
2) Would the same type of allocation happen on CPU for parallel computation (if the answer to 1) is yes)?
In general batch size is what you need to tune-up.
And as for your query batch size is data-dependent, and as you use batches, you are generally running a generator object, which loads data in batches, perform GD and then move on next.
It is preferred to use batch gradient decent as it converges faster than GD
Also as you increase batch size, so more no of training no of examples will be loaded, increasing memory allocation,
Yes you can use parallel computation for training large batches but overall you are doing same, as you are actually calculating whole batches each time which you are doing in genral batch computation
CPU should have cores, Then Yes, Else You Need GPU as Computing Requires A lOt of powers Because all you are doing under the hood is working with n dimensional matrices, calculating partial derivatives and then calculating square loss and further updating weights values