I'm performing linear regression using Keras. My dataset is comprised of 50 1D input points and 50 1D output points. In order to perform linear regression, I'm training a neural network with a single layer and a single neuron, with no activation function. The neural network is defined as
model = Sequential()
model.add(Dense(1, input_dim=1, kernel_initializer='zeros',
bias_initializer='zeros'))
and I ask Keras to find the optimal value of w and b, using SGD as the optimizer and the mean squared error as the loss function.
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01))
model.fit(x,y,epochs=100, callbacks=[history], verbose=0, batch_size=50);
where history
is a callback I created to save the current weight and bias at each step of the optimization.
I then proceed to plot the level curves of the loss function, together with the optimization trajectory in the w
x b
space. The output is the following.
The optimization trajectory is shown in red circles, and the global optimum is shown as a blue 'x'. This seems reasonable, since we started at [0,0]
and after each iteration we approach the global optimum. Eventually the gradient starts to get so small that we stop improving.
However, I understand that by using gradient descent, one would always move in the direction of the gradient at the current point (i.e. perpendicular to the level curves). This optimization trajectory doesn't seem to behave like that. Is Keras SGD
optimizer doing something else under the hood? Or am I missing something?
EDIT: Although the plot seems to illustrate that the level curves are parallel lines, they are actually ellipsoids, but very elongated. Choosing a different range to plot them reveals this.
EDIT 2: To avoid any confusion related to how I could have plotted the image shown in this question, I have now created a gist with the code.
It is orthogonal (0.2 vs -5 slopes), but the x/y units of your graph aren't the same. Scaling in a given direction doesn't preserve orthogonality.