Search code examples
parameterskerasgeneratordata-fitting

Meaning of validation_steps in Keras Sequential fit_generator parameter list


I am using Keras with a Tensorflow backend in Python. To be more precise tensorflow 1.2.1 and its build-in contrib.keras lib.

I want to use the fit_generator method of a Sequential model object, but I am confused with what I should pass as the method-parameters.

From reading the doc here I got the following information:

  • generator : a python training data batch generator; endlessly looping over its training data
  • validation_data: -in my case - a python validation data batch generator; the doc doesn't mention endless looping over its validation data
  • steps_per_epoch : number of training batches = uniqueTrainingData / batchSize
  • validation steps : ??? ; = uniqueValidationData / batch size ???
  • use_multiprocessing : boolean; don't pass non picklable arguments ???
  • workers : max number of used processes

As indicated above with ??? I don't really know what validation_steps means. I know the definition of the above linked doc (Number of steps to yield from validation generator at the end of every epoch) but that only confuses my in the given context. From the doc i know that the validation_data generator has to yield data, label tuples in the form (inputs, targets). In contrast to that the above statement indicates that there have to be multiple "steps to yield from validation generator at the end of every epoch" which in this context would mean, that multiple validation batches would be yielded after each training epoch.

Questions about validation_steps:

  • Does it really work that way? If so: Why? I thought that after each epoch one validation batch, which ideally wasn't used before, is used for validation to ensure that the training gets validated without risking to "train" the model to perform better on already used validation sets.
  • In context of the previous question: Why is the recommended amount of validation steps uniqueValidationData / batches and not uniqueValidationData / epochs? Isn't it better to have e.g. 100 validation batches for 100 epochs instead of x validation batches where x could be less or more than the specified number of epochs? Alternatively: If you have much less validation batches than number of epoches, is the model trained without validation for the rest of the epochs or do validation sets get reused / reshuffled+reused?
  • Is it important that the training and validation batches have the same batch size (shared divisor of the dividends trainingDataCount and validationDataCount)?

Additional question about use_multiprocessing:

  • Are numpy arrays picklable or do I have to convert them to multidimensional lists?

Solution

  • The validation generator works exactly like the training generator. You define how many batches it will wield per epoch.

    • The training generator will yield steps_per_epoch batches.
    • When the epoch ends, the validation generator will yield validation_steps batches.

    But validation data has absolutely no relation to training data. There is no need to separate validation batches according to training batches (I would even say that there is no point in doing that, unless you have a very specific intention). Also, the total number of samples in training data is not related to the total number of samples in test data.

    The point of having many batches is just to spare your computer's memory, so you test smaller packs one at a time. Probably, you find a batch size that will fit your memory or expected training time and use that size.

    That said, Keras gives you a totally free method, so you can determine the training and the validation batches as you wish.

    Epochs:

    Ideally, you use all your validation data at once. If you use only part of your validation data, you will get different metrics for each batch, what may make you think that your model got worse or better when it actually didn't, you just measured different validation sets.

    That's why they suggest validation_steps = total_validation_samples // validation_batch_size.
    Theoretically, you test your entire data every epoch, as you theoretically should also train your entire data every epoch.

    So, theorethycally each epoch yields:

    • steps_per_epoch = TotalTrainingSamples / TrainingBatchSize
    • validation_steps = TotalvalidationSamples / ValidationBatchSize

    Basically, the two vars are: how many batches per epoch you will yield.
    This makes sure that at each epoch:

    • You train exactly your entire training set
    • You validate exactly your entire validation set

    Nevertheless, it's totally up to you how you separate your training and validation data.

    If you do want to have one different batch per epoch (epochs using less than your entire data), it's ok, just pass steps_per_epoch=1 or validation_steps=1, for instance. The generator is not resetted after each epoch, so the second epoch will take the second batch, and so on, until it loops again to the first batch.

    I prefer training the entire data per epoch, and if the time is too long, I use a callback that shows the logs at the end of each batch:

    from keras.callbacks import LambdaCallback
    
    callbacks = callbacks=[LambdaCallback(on_batch_end=lambda batch,logs:print(logs))]
    

    Multiprocessing

    I was never able to use use_multiprocessing=True, it freezes at the start of the first epoch.

    I've noticed the workers are related to how many batches are preloaded from the generator. If you define max_queue_size=1, you will have exactly workers amount of batches preloaded.

    They suggest you use keras Sequences when multiprocessing. The sequences work pretty much as a generator, but it keeps track of the order/position of each batch.