Search code examples
c++algorithmmatlabstatisticsmathematical-optimization

Weighted linear least square for 2D data point sets


My question is an extension of the discussion How to fit the 2D scatter data with a line with C++. Now I want to extend my question further: when estimating the line that fits 2D scatter data, it would be better if we can treat each 2D scatter data differently. That is to say, if the scatter point is far away from the line, we can give the point a low weighting, and vice versa. Therefore, the question then becomes: given an array of 2D scatter points as well as their weighting factors, how can we estimate the linear line that passes them? A good implementation of this method can be found in this article (weighted least regression). However, the implementation of the algorithm in that article is too complicated as it involves matrix calculation. I am therefore trying to find a method without matrix calculation. The algorithm is an extension of simple linear regression, and in order to illustrate the algorithm, I wrote the following MATLAB codes:

function line = weighted_least_squre_for_line(x,y,weighting);


part1 = sum(weighting.*x.*y)*sum(weighting(:));

part2 = sum((weighting.*x))*sum((weighting.*y));

part3 = sum( x.^2.*weighting)*sum(weighting(:));

part4 = sum(weighting.*x).^2; 

beta = (part1-part2)/(part3-part4);

alpha = (sum(weighting.*y)-beta*sum(weighting.*x))/sum(weighting);

a = beta;
c = alpha;
b = -1;
line = [a b c];

In the above codes, x,y,weighting represent the x-coordinate, y-coordinate and the weighting factor respectively. I test the algorithm with several examples but still not sure whether it is right or not as this method gets a different result with Polyfit, which relies on matrix calculation. I am now posting the implementation here and for your advice. Do you think it is a right implementation? Thanks!


Solution

  • If you think it is a good idea to downweight points that are far from the line, you might be attracted by http://en.wikipedia.org/wiki/Least_absolute_deviations, because one way of calculating this is via http://en.wikipedia.org/wiki/Iteratively_re-weighted_least_squares, which will give less weight to points far from the line.