a newbie here, I've been training a simple Dogs vs Cats model on a potato pc with no gpu so I have to pause and resume it sometimes. Yesterday I realized I would get better performance if I decrease the batch size so I changed it from 128 to 64 and then doubled the epoch count from 25 to 50(is this the right thing to do ?). I use a callback with a save_at_{epoch}.keras to save the progress and then resume it with loading the saved model and changing initial epoch to match it. Now let's say I left of at epoch 8/25 so i have the save_at_8.keras file. Now that I've changed the batch size to 64 should I set the intital epoch to 16 or to 8?
An epoch is a single pass through your dataset. A step is a single batch of data from your dataset. So the typical training loop will look like
for epoch in epochs:
for step in len(dataset) // batch_size:
update_weights(dataset[step])
save_checkpoint(epoch)
So changing batch_size
will modify how many steps are preformed per epochs but it doesn't change how many epochs. This is either fixed, or has an upper limit.
So you can resume training from a checkpoint, and it's up to you whether you want to adjust the number of epochs or not, it really doesn't matter all that much.
The only caveat is some trainers have "warmup" or "learning rate schedulers" which in theory are based on the number of steps performed so restarting at epoch > 0 without adjusting their parameters may cause issues.