In R, I generated some artificial data to perform a linear regression using gradient descent method Y = c0 + c1 * x1 + c2 * x2 + noise
I also use a analytical method to calculate the parameters theta = [c0, c1, c2]. Below is the R codes with notes for the variables.
I use gradient descent method to calculate theta. The formulae are taken from the link below.
slides from Stanford - Andrew Ng
However, the method is unable to converge. My R codes is below. theta is very different from the analytical solution k in the R codes.
rm(list = ls())
n=500
x1=rnorm(n,mean=4,sd=1.6)
x2=rnorm(n,mean=4,sd=2.5)
X=cbind(x1,x2)
A=as.matrix(cbind(rep(1,n),x1,x2))
Y=-3.9+3.8*x1-2.4*x2+rnorm(n,mean=0,sd=1.5);
k=solve(t(A)%*%A,t(A)%*%Y) # k is the parameters determined by analytical method
MSE=sum((A%*%k-Y)^2)/(n);
iterations=3000 # total number of step
epsilon = 0.0001 # set precision
eta=0.0001 # step size
t1=integer(iterations)
e1=integer(iterations)
X=as.matrix(X)# convert data table X into a matrix
N=dim(X)[1] # total number of observations
X=as.matrix(cbind(rep(1,length(N)),X))# add a column of ones to represent intercept
np=dim(X)[2] # number of parameters to be determined
theta=matrix(rnorm(n=np,mean=0,sd=1),1,np) # Initialize theta:1 x np matrix
for(i in 1:iterations){
error =theta%*%t(X)-t(Y) # error = (theta * x' -Y'). Error is a 1xN row vector;
grad=(1/N)*error%*%X # Gradient grad is 1 x np vector
theta=theta-eta*grad # updating theta
L=sqrt(sum((eta*grad)^2)) # calculating the L2 norm
e1[i]=sum((error)^2)/(2*N) # record the cost function in each step (value=2*MSE)
t1[i]=L # record the L2 norm in each step
if(L<=epsilon){ # checking whether convergence is obtained or not
break
}
}
plot(e1*2,type="l",ylab="MSE",lwd=2,col=rgb(0,0,1))
abline(h=MSE)
text(x=1000,y=MSE+1,labels = "Actual MSE",adj=1)
text(x=500,y=15,labels = "Gradient Descent",adj=0.4)
theta
k
your problem is that you set your eta
very conservative. Hence it takes a very long time to converge. Eta is crucial in convergence speed. If it is chosen to large the algorithm might not converge, however. You will likely here later in the course about algorithm which automatically adjust eta like Adagrad or Adam
If you choose eta = 0.001
eta= 0.01
eta=0.1