I developed a image processing program that identifies what a number is given an image of numbers. Each image was 27x27 pixels = 729 pixels. I take each R, G and B value which means I have 2187 variables from each image (+1 for the intercept = total of 2188).
I used the below gradient descent formula:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj
is the coefficient on variable j; α
is the learning rate; hθ(x)
is the hypothesis; y
is real value and xj
is the value of variable j. m
is the number of training sets. hθ(x)
, y
are for each training set (i.e. that's what the summation sign is for). Further the hypothesis is defined as:
hθ(x) = 1/(1+ e^-z)
z= θo + θ1X1+θ2X2 +θ3X3...θnXn
With this, and 3000 training images, I was able to train my program in just over an hour and when tested on a cross validation set, it was able to identify the correct image ~ 67% of the time.
I wanted to improve that so I decided to attempt a polynomial of degree 2.
However the number of variables jumps from 2188 to 2,394,766 per image! It takes me an hour just to do 1 step of gradient descent.
So my question is, how is this vast number of variables handled in machine learning? On the one hand, I don't have enough space to even hold that many variables for each training set. On the other hand, I am currently storing 2188 variables per training sample, but I have to perform O(n^2) just to get the values of each variable multiplied by another variable (i.e. the polynomial to degree 2 values).
So any suggestions / advice is greatly appreciated.
try to use some dimensionality reduction first (PCA, kernel PCA, or LDA if you are classifying the images)
vectorize your gradient descent - with most math libraries or in matlab etc. it will run much faster
parallelize the algorithm and then run in on multiple CPUs (but maybe your library for multiplying vectors already supports parallel computations)