Search code examples
pythonnumpymatlableast-squarespolynomials

Dissimilarity between the output of linear equation produced by Numpy Polynomial and Matlab polyfit


The objective is to find the coefficients of a polynomial P(x) of degree 1 (a line) that fits the data y best in a least-squares sense, and then compute the root of P.

In Matlab, the linear least squares fit can be calculated via

[p, S, mu] = polyfit(x, y, 1)

This produces the coefficients p, which define the polynomial:

-1.5810877
6.0094824

y = p(1) * x + p(2)

mu represents the mean and std of the x.

In Python, this can be achieved via the Numpy's polynomial as suggested here:

import numpy as np

x = [
    6210, 6211, 6212, 6213, 6214, 6215, 6216, 6217, 6218, 6219,
    6220, 6221, 6222, 6223, 6224, 6225, 6226, 6227, 6228, 6229,
    6230, 6231, 6232, 6233, 6234, 6235, 6236, 6237, 6238, 6239,
    6240, 6241, 6242, 6243, 6244, 6245, 6246, 6247, 6248, 6249,
    6250, 6251, 6252, 6253, 6254, 6255, 6256, 6257, 6258, 6259,
    6260, 6261, 6262, 6263, 6264, 6265, 6266, 6267, 6268, 6269,
    6270, 6271, 6272, 6273, 6274, 6275, 6276, 6277, 6278, 6279,
    6280, 6281, 6282, 6283, 6284, 6285, 6286, 6287, 6288
]
y = [
    7.8625913, 7.7713094, 7.6833806, 7.5997391, 7.5211883,
    7.4483986, 7.3819046, 7.3221073, 7.2692747, 7.223547,
    7.1849418, 7.1533613, 7.1286001, 7.1103559,  7.0982385,
    7.0917811, 7.0904517, 7.0936642, 7.100791, 7.1111741,
    7.124136, 7.1389918, 7.1550579, 7.1716633, 7.1881566,
    7.2039142, 7.218349, 7.2309117, 7.2410989, 7.248455,
    7.2525721, 7.2530937, 7.249711, 7.2421637, 7.2302341,
    7.213747, 7.1925621, 7.1665707, 7.1356878, 7.0998487,
    7.0590014, 7.0131001, 6.9621005, 6.9059525, 6.8445964,
    6.7779589, 6.7059474, 6.6284504, 6.5453324, 6.4564347,
    6.3615761, 6.2605534, 6.1531439, 6.0391097, 5.9182019,
    5.7901659, 5.6547484, 5.5117044, 5.360805, 5.2018456,
    5.034656, 4.8591075, 4.6751242, 4.4826899, 4.281858,
    4.0727611, 3.8556159, 3.6307325, 3.3985188, 3.1594861,
    2.9142516, 2.6635408, 2.4081881, 2.1491354, 1.8874279,
    1.6242117,1.3607255,1.0982931,0.83831298
]

p = np.polynomial.polynomial.Polynomial.fit(x, y, 1, domain=[-1, 1])
print(p)
436.53467443432453 - 0.0688950539698132·x¹

Both the mean and std can be calculated as follows:

x_mean = np.mean(x)
x_std = np.std(x)

I notice however, the coefficients produce by Matlab's polyfit and the Numpy's Polynomial are different:

MATLAB: [-1.5810877, 6.0094824] vs Python: [436.53467443432453, 0.0688950539698132]

The difference is significant for my use-case, because I want to compute the roots of the line.

In Matlab, this can be calculates as

>> x_m = p(1);
>> x_mean = mu(1);
>> x_c = p(2);
>> x_std = mu(2);
>> x_intercept = (x_m * x_mean - x_c * x_std) ./ x_m
x_intercept = 
    6336.2266

Whereas in Python

>>> x_mean = np.mean(x)
>>> x_std = np.std(x)
>>> x_c, x_m = line.coef
>>> x_intercept = (x_m * x_mean - x_c * x_std) / x_m
>>> x_intercept
150737.19742902054

Clearly, the difference between these two is big. How can I replicate the x_intercept calculated using Matlab in Python?


Solution

  • For the sake of numerical precision, it is handy to map polynomial data to the domain [-1, 1] or some similarly well-bounded region when computing fits. A classic example is given in the numpy docs regarding fitting Chebyshev polynomials, which appear to lose all their interesting features if you zoom out far enough. The mapping is less interesting for lines, but highly useful for higher order polynomials, where x**n may explode. The specific window [-1, 1] is used exactly to prevent such explosions for arbitrarily shifted data with large n (since +/-1**n is always just +/-1, and x**n for |x| < 1 is always well bounded).

    Let's start by looking at the data (which is always a good first step):

    enter image description here

    I've included a line fit here. Just by eye, I can tell that a root at ~6340 is reasonable for the line fit. It's also pretty clear that the data is actually more cubic in nature, with an "actual" root around 6290, but that's irrelevant to the question.

    MATLAB takes the approach of mapping the x-data such that the 1-sigma portion of the domain fits within the window [-1, 1]. That means that the polynomial coefficients you get apply to (x - mean(xRight)) / std(xRight). If you choose to do scaling, by requesting three output arguments from polyfit, you do not have the option of choosing a different scale:

    >> [p_scaled, s, m] = polyfit(xRight, yRight, 1)
    p_scaled =
       -1.5811    6.0095
    s = 
      struct with fields:
            R: [2×2 double]
           df: 77
        normr: 8.5045
    m =
       1.0e+03 *
        6.2490
        0.0229
    
    >> p_unscaled = polyfit(xRight, yRight, 1)
    p_unscaled =
       -0.0689  436.5347
    

    You can compute the values of both fits at say x = 6250 using polyval:

    >> polyval(p_scaled, 6250, s, mu)
    ans =
        5.9406
    
    >> polyval(p_unscaled, 6250)
    ans =
        5.9406
    
    >> polyval(p_scaled, (6250 - mean(xRight)) / std(xRight))
    ans =
        5.9406
    

    And of course manually:

    >> (6250 - m(1)) / m(2) * p_scaled(1) + p_scaled(2)
    ans =
        5.9406
    
    >> 6250 * p_unscaled(1) + p_unscaled(2)
    ans =
        5.9406
    

    Python does something similar with the domain and window arguments to np.polynomial.Polynomial.fit. Just as MATLAB maps [-std(x), std(x)] + mean(x) to [-1, 1], domain is mapped to window. The biggest difference is that you can choose both domain and window. Here are some common options:

    >>> p_nomap = np.polynomial.Polynomial.fit(xRight, yRight, 1, domain=[-1, 1]); print(p_nomap)
    436.53467443435204 - 0.06889505396981752 x**1
    >>> p_default = np.polynomial.Polynomial.fit(xRight, yRight, 1); print(p_default)
    6.009482176962028 - 2.686907104822783 x**1
    >>> p_minmax = np.polynomial.Polynomial.fit(xRight, yRight, 1, domain=[xRight_arr.min(), xRight_arr.max()]); print(p_minmax)
    6.009482176962028 - 2.686907104822783 x**1
    >>> p_matlab = np.polynomial.Polynomial.fit(xRight, yRight, 1, domain=np.std(xRight) * np.arange(-1, 2, 2) + np.mean(xRight)); print(p_matlab)
    6.009482176962061 - 1.571048948945243 x**1
    

    You can see the following equivalences:

    • Numpy's p_matlab <=> MATLAB's p_scaled
    • Numpy's p_nomap <=> MATLAB's p_unscaled
    • Numpy's p_minmax <=> Numpy's p_default

    You can verify with a similar test as in MATLAB, except that instead of polyval, you can just call a Polynomial object directly:

    >>> p_nomap(6250)
    5.940587122992554
    >>> p_default(6250)
    5.940587122992223
    >>> p_minmax(6250)
    5.940587122992223
    >>> p_matlab(6250)
    5.94058712299223
    

    Python has a nicer syntax in this case, and since you are using an object, it keeps track of all the scales and offsets for you internally.

    You can estimate the root of the line in any number of ways now that you understand how the different choices of domain work. In MATLAB:

    >> roots(p_unscaled)
    ans =
       6.3362e+03
    >> roots(p_scaled) * m(2) + m(1)
    ans =
       6.3362e+03
    >> -p_unscaled(2) / p_unscaled(1)
    ans =
       6.3362e+03
    >> -p_scaled(2) / p_scaled(1) * m(2) + m(1)
    ans =
       6.3362e+03
    

    You can do the same thing in numpy:

    >>> p_nomap.roots()
    array([6336.22661252])
    >>> p_default.roots()
    array([6336.22661252])
    >>> p_minmax.roots()
    array([6336.22661252])
    >>> p_matlab.roots()
    array([6336.22661252])
    

    To evaluate manually, we can use Polynomial.mapparms, which is roughly equivalent to the output parameter m in MATLAB. The major difference is that mapparms outputs [-np.mean(xRight) / np.std(xRight), 1 / np.std(xRight)], which simplifies the evaluation process.

    >>> -p_nomap.coef[0] / p_nomap.coef[1]
    6336.22661251699
    >>> (-p_default.coef[0] / p_default.coef[1] - p_default.mapparms()[0]) / p_default.mapparms()[1]
    6336.226612516987
    >>> (-p_minmax.coef[0] / p_minmax.coef[1] - p_minmax.mapparms()[0]) / p_minmax.mapparms()[1]
    6336.226612516987
    >>> (-p_matlab.coef[0] / p_matlab.coef[1] - p_matlab.mapparms()[0]) / p_matlab.mapparms()[1]
    6336.226612516987
    

    So the key to getting equivalent results is to either choose identical parameters (and understand their correspondence), or to use the provided functions to evaluate polynomials and their roots. In general I recommend the latter regardless of the former. The reason those functions were provided is that you can get accurate results from any choice of domain, as long as you pass the data along consistently (manually for polyval and automatically for Polynomial.__call__)