Linear regression with weighted points python

ax.scatter(x = NewUFOTimes['one'], y = NewUFOTimes['three'], s = (NewUFOTimes['two']/10))

How do I make a linear regression with the larger dots having more weight?

Solution

For this kind of work in Python the numpy module is a must-have.

I infer that you are working with pandas - if so, you'll already have this library installed as a dependency and I therefore think we can assume that the following will work:

x = NewUFOTimes["one"].values
y = NewUFOTimes["three"].values
w = NewUFOTimes["two"].values

This should store the underlying numpy arrays in x, y, and w. I'm going to also assume ax is a matplotlib Axes object.

Next we will use numpy's ability to fit a polynomial, in this case a first degree polynomial (linear fit). Out of the box, this fit supports optional weighting. You can read the full documentation here.

Here's what that looks like:

from numpy.polynomial import Polynomial

... #  code that produces your original plot

x = NewUFOTimes["one"].values
y = NewUFOTimes["three"].values
w = NewUFOTimes["two"].values

line = Polynomial.fit(x, y, 1, w=w)

fit_x, fit_y = line.linspace()

ax.plot(fit_x, fit_y, '-', label="Weighted fit")

Now, I don't know where your weights came from, so I don't know whether they will obey the numpy documentation's suggestion that

Ideally the weights are chosen so that the errors of the products w[i]*y[i] all have the same variance.

Perhaps you are using inverse-variance weighting. The documentation states:

When using inverse-variance weighting, use w[i] = 1/sigma(y[i]).

It's important to be thoughtful about how you introduce weightings into the fit. Linear regression typically assumes homoscedasticity (i.e. 'all weights equal') so you should be sure your introduction of weights is well-motivated and well-executed.

As a quick aside, the 'old' way to do this was to use numpy.polyfit. However, the docstring for that function now includes the proviso

The Polynomial.fit <numpy.polynomial.polynomial.Polynomial.fit> class method is recommended for new code as it is more stable numerically. See the documentation of the method for more information.

It is worth noting that there is a trade-off. You may still wish to use the old method if it is important to you to receive the covariance matrix associated with the fit (cov=True option with polyfit). However, if that's not important then the above approach is likely best. I should add: the use of weightings in the fit likely prevents the covariance matrix from being useful anyway!