Search code examples
machine-learningtensorflowlinear-regressionestimationhyperparameters

Is tuning batch size or epochs necessary for linear regression with TensorFlow?


I am working on an article where I focus on a simple problem – linear regression over a large data set in the presence of standard normal or uniform noise. I chose Estimator API from TensorFlow as the modeling framework.

I am finding that, hyperparameter tuning is, in fact, of little importance for such a machine learning problem when the number of training steps can be made sufficiently large. By hyperparameter I mean batch size or number of epochs in the training data stream.

Is there any paper/article with formal proof of this?


Solution

  • I don't think there is a paper specifically focused on this question, because it's a more or less fundamental fact. The introductory chapter of this book discusses the probabilistic interpretation of machine learning in general and loss function optimization in particular.

    In short, the idea is this: mini-batch optimization wrt (x1,..., xn) is equivalent to consecutive optimization steps wrt x1, ..., xn inputs, because the gradient is a linear operator. This means that mini-batch update equals to the sum of its individual updates. Important note here: I assume that NN doesn't apply batch-norm or any other layer that adds an explicit variation to the inference model (in this case the math is a bit more hairy).

    So the batch size can be seen as a pure computational idea that speeds up the optimization through vectorization and parallel computing. Assuming that one can afford arbitrarily long training and the data are properly shuffled, the batch size can be set to any value. But it isn't automatically true for all hyperparameters, for example very high learning rate can easily force the optimization to diverge, so don't make a mistake thinking hyperparamer tuning isn't important in general.