Gradient descent method does not work well in linear regression?

In R, I generated some artificial data to perform a linear regression using gradient descent method Y = c0 + c1 * x1 + c2 * x2 + noise

I also use a analytical method to calculate the parameters theta = [c0, c1, c2]. Below is the R codes with notes for the variables.

I use gradient descent method to calculate theta. The formulae are taken from the link below.

slides from a person

slides from Stanford - Andrew Ng

However, the method is unable to converge. My R codes is below. theta is very different from the analytical solution k in the R codes.

rm(list = ls())

n=500
x1=rnorm(n,mean=4,sd=1.6)
x2=rnorm(n,mean=4,sd=2.5)


X=cbind(x1,x2)
A=as.matrix(cbind(rep(1,n),x1,x2))
Y=-3.9+3.8*x1-2.4*x2+rnorm(n,mean=0,sd=1.5);


k=solve(t(A)%*%A,t(A)%*%Y) # k is the parameters determined by analytical method
MSE=sum((A%*%k-Y)^2)/(n);

iterations=3000 # total number of step
epsilon = 0.0001 # set precision
eta=0.0001 # step size

t1=integer(iterations)
e1=integer(iterations)

X=as.matrix(X)# convert data table X into a matrix
N=dim(X)[1] # total number of observations
X=as.matrix(cbind(rep(1,length(N)),X))# add a column of ones to represent intercept
np=dim(X)[2] # number of parameters to be determined
theta=matrix(rnorm(n=np,mean=0,sd=1),1,np) # Initialize theta:1 x np matrix
for(i in 1:iterations){
  error =theta%*%t(X)-t(Y) # error = (theta * x' -Y'). Error is a 1xN row vector;
  grad=(1/N)*error%*%X # Gradient grad is 1 x np vector
  theta=theta-eta*grad # updating theta
  L=sqrt(sum((eta*grad)^2)) # calculating the L2 norm
  e1[i]=sum((error)^2)/(2*N) # record the cost function in each step (value=2*MSE)
  t1[i]=L # record the L2 norm in each step
  if(L<=epsilon){ # checking whether convergence is obtained or not
    break
  }
}

plot(e1*2,type="l",ylab="MSE",lwd=2,col=rgb(0,0,1))
abline(h=MSE)
text(x=1000,y=MSE+1,labels = "Actual MSE",adj=1)
text(x=500,y=15,labels = "Gradient Descent",adj=0.4)
theta
k

Solution

your problem is that you set your eta very conservative. Hence it takes a very long time to converge. Eta is crucial in convergence speed. If it is chosen to large the algorithm might not converge, however. You will likely here later in the course about algorithm which automatically adjust eta like Adagrad or Adam If you choose eta = 0.001

eta= 0.01

eta=0.1