I am working on algorithm to perform linear regression for one or more independent variables.
that is: (if I have m
real world values and in the case of two independent variables a
and b
)
C + D*a1 + E* b1 = y1
C + D*a2 + E* b2 = y2
...
C + D*am + E* bm = ym
I would like to use the least squares solution to find best fitting straight line.
I will be using the matrix notation so
where Beta is the vector [C, D, E] where these values will be the best fit line.
Question
What is the best way to solve this formula? Should I compute the inverse of
or should I use the LU factorization/decmposition of the matrix. What is the performance of each on large amount of data (i.e a big value of m
, could be in order of 10^8 ...)
EDIT
If the answer was to use Cholesky decomposition or QR decomposition, are there any implementation hints/ simple libraries to use. I am coding in C/ C++.
Your X^TX matrix should have a Cholesky decomposition. I'd look into this decomposition before LU. It is faster: http://en.wikipedia.org/wiki/Cholesky_decomposition