ax.scatter(x = NewUFOTimes['one'], y = NewUFOTimes['three'], s = (NewUFOTimes['two']/10))
[1]: https://i.sstatic.net/oNb9r.png
How do I make a linear regression with the larger dots having more weight?
For this kind of work in Python the numpy
module is a must-have.
I infer that you are working with pandas
- if so, you'll already have this library installed as a dependency and I therefore think we can assume that the following will work:
x = NewUFOTimes["one"].values
y = NewUFOTimes["three"].values
w = NewUFOTimes["two"].values
This should store the underlying numpy
arrays in x
, y
, and w
. I'm going to also assume ax
is a matplotlib
Axes object.
Next we will use numpy
's ability to fit a polynomial, in this case a first degree polynomial (linear fit). Out of the box, this fit supports optional weighting. You can read the full documentation here.
Here's what that looks like:
from numpy.polynomial import Polynomial
... # code that produces your original plot
x = NewUFOTimes["one"].values
y = NewUFOTimes["three"].values
w = NewUFOTimes["two"].values
line = Polynomial.fit(x, y, 1, w=w)
fit_x, fit_y = line.linspace()
ax.plot(fit_x, fit_y, '-', label="Weighted fit")
Now, I don't know where your weights came from, so I don't know whether they will obey the numpy
documentation's suggestion that
Ideally the weights are chosen so that the errors of the products w[i]*y[i] all have the same variance.
Perhaps you are using inverse-variance weighting. The documentation states:
When using inverse-variance weighting, use w[i] = 1/sigma(y[i]).
It's important to be thoughtful about how you introduce weightings into the fit. Linear regression typically assumes homoscedasticity (i.e. 'all weights equal') so you should be sure your introduction of weights is well-motivated and well-executed.
As a quick aside, the 'old' way to do this was to use numpy.polyfit
. However, the docstring for that function now includes the proviso
The
Polynomial.fit <numpy.polynomial.polynomial.Polynomial.fit>
class method is recommended for new code as it is more stable numerically. See the documentation of the method for more information.
It is worth noting that there is a trade-off. You may still wish to use the old method if it is important to you to receive the covariance matrix associated with the fit (cov=True
option with polyfit
). However, if that's not important then the above approach is likely best. I should add: the use of weightings in the fit likely prevents the covariance matrix from being useful anyway!