Deeplearning4j how should data be normalized

How should my input data be normalized for model.fit in Deeplearning4j?

Currently I have an iteration over a larger amount of data.

I can see that some people normalize to the complete data set.

In my head it is more logical that the dataset of each iteration gets normalized before model.fit.

Is there some best practice of coding the normalization within the iterator?

And what about the input for the prediction?

Solution

You should always normalize to your training-set. If you only normalize to every batch, what would you do when normalizing for inference, when you have only a single example?

If you use a normalizer that is based on statistics (i.e. you normalize to a zero mean, unit variance; e.g. NormalizerStandardize), then you will have to .fit() it on your DataSetIterator first. This will go through all your data and collect the necessary statistics in order to be able to normalize the data properly.

Afterwards, and for normalizers that don't need to fit to the data (i.e. if you have a fixed range, like you would have with images), you set the normalizer on your DataSetIterator using .setPreProcessor(normalizer). From this point on, your DataSetIterator will be returning normalized values.

When you get to prediction, you use the same normalizer that you used for training and normalize your new input data with it.

If your normalizer had to be fit on the data, you can use its .save() method to save it, and use its .load() method to load it. For other normalizers, you can just create a new instance.