Search code examples
pythonnumpyregressioncurve-fittingsmoothing

Confusing result with quadratic regression


So, I'm trying to fit some pairs of x,y data with a quadratic regression, a sample formula can be found at http://polynomialregression.drque.net/math.html. Following is my code that does the regression using that explicit formula and using numpy inbuilt functions,

import numpy as np 
x = [6.230825,6.248279,6.265732]
y = [0.312949,0.309886,0.306639472]
toCheck = x[2]


def evaluateValue(coeff,x):
    c,b,a = coeff
    val = np.around( a+b*x+c*x**2,9)
    act = 0.306639472
    error=  np.abs(act-val)*100/act
    print "Value = {:.9f} Error = {:.2f}%".format(val,error)



###### USing numpy######################
coeff = np.polyfit(x,y,2)
evaluateValue(coeff, toCheck)



################# Using explicit formula
def determinant(a,b,c,d,e,f,g,h,i):
    # the matrix is [[a,b,c],[d,e,f],[g,h,i]]
    return a*(e*i - f*h) - b*(d*i - g*f) + c*(d*h - e*g)

a = b = c = d = e = m = n = p = 0
a = len(x)
for i,j in zip(x,y):
        b += i
        c += i**2
        d += i**3
        e += i**4
        m += j
        n += j*i
        p += j*i**2
det = determinant(a,b,c,b,c,d,c,d,e)
c0 = determinant(m,b,c,n,c,d,p,d,e)/det
c1 = determinant(a,m,c,b,n,d,c,p,e)/det
c2 = determinant(a,b,m,b,c,n,c,d,p)/det

evaluateValue([c2,c1,c0], toCheck)




######Using another explicit alternative
def determinantAlt(a,b,c,d,e,f,g,h,i):
    return a*e*i - a*f*h - b*d*i +b*g*f + c*d*h - c*e*g # <- barckets removed

a = b = c = d = e = m = n = p = 0
a = len(x)
for i,j in zip(x,y):
        b += i
        c += i**2
        d += i**3
        e += i**4
        m += j
        n += j*i
        p += j*i**2
det = determinantAlt(a,b,c,b,c,d,c,d,e)
c0 = determinantAlt(m,b,c,n,c,d,p,d,e)/det
c1 = determinantAlt(a,m,c,b,n,d,c,p,e)/det
c2 = determinantAlt(a,b,m,b,c,n,c,d,p)/det

evaluateValue([c2,c1,c0], toCheck)

This code gives this output

Value = 0.306639472 Error = 0.00%
Value = 0.308333580 Error = 0.55%
Value = 0.585786477 Error = 91.03%

As, you can see these are different from each other and third one is totally wrong. Now my questions are:
1. Why the explicit formula is giving slightly wrong result and how to improve that?
2. How numpy is giving so accurate result?
3. In the third case only by openning the parenthesis, how come the result changes so drastically?


Solution

  • So there are a few things that are going on here that are unfortunately plaguing the way you are doing things. Take a look at this code:

    for i,j in zip(x,y):
            b += i
            c += i**2
            d += i**3
            e += i**4
            m += j
            n += j*i
            p += j*i**2
    

    You are building features such that the x values are not only squared, but cubed and fourth powered.

    If you print out each of these values before you put them into the 3 x 3 matrix to solve:

    In [35]: a = b = c = d = e = m = n = p = 0
        ...: a = len(x)
        ...: for i,j in zip(xx,y):
        ...:         b += i
        ...:         c += i**2
        ...:         d += i**3
        ...:         e += i**4
        ...:         m += j
        ...:         n += j*i
        ...:         p += j*i**2
        ...: print(a, b, c, d, e, m, n, p)
        ...:
        ...:
    3 18.744836 117.12356813829001 731.8283056811686 4572.738547313946 0.9294744720000001 5.807505391292503 36.28641270376207
    

    When dealing with floating-point arithmetic and especially for small values, the order of operations does matter. What's happening here is that by fluke, the mix of both small values and large values that have been computed result in a value that is very small. Therefore, when you compute the determinant using the factored form and expanded form, notice how you get slightly different results but also look at the precision of the values:

    In [36]: det = determinant(a,b,c,b,c,d,c,d,e)
    
    In [37]: det
    Out[37]: 1.0913403514223319e-10
    
    In [38]: det = determinantAlt(a,b,c,b,c,d,c,d,e)
    
    In [39]: det
    Out[39]: 2.3283064365386963e-10
    

    The determinant is on the order of 10-10! The reason why there's a discrepancy is because with floating-point arithmetic, theoretically both determinant methods should yield the same result but unfortunately in reality they are giving slightly different results and this is due to something called error propagation. Because there are a finite number of bits that can represent a floating-point number, the order of operations changes how the error propagates, so even though you are removing the parentheses and the formulas do essentially match, the order of operations to get to the result are now different. This article is an essential read for any software developer who deals with floating-point arithmetic regularly: What Every Computer Scientist Should Know About Floating-Point Arithmetic.

    Therefore, when you're trying to solve the system with Cramer's Rule, inevitably when you divide by the main determinant in your code, even though the change is on the order of 10-10, the change is negligible between the two methods but you will get very different results because you're dividing by this number when solving for the coefficients.

    The reason why NumPy doesn't have this problem is because they solve the system by least-squares and the pseudo-inverse and not using Cramer's Rule. I would not recommend using Cramer's Rule to find regression coefficients mostly due to experience and that there are more robust ways of doing it.

    However to solve your particular problem, it's good to normalize the data so that the dynamic range is now centered at 0. Therefore, the features you use to construct your coefficient matrix are more sensible and thus the computational process has an easier time dealing with the data. In your case, something as simple as subtracting the data with the mean of the x values should work. As such, if you have new data points you want to predict, you must subtract by the mean of the x data first prior to doing the prediction.

    Therefore at the beginning of your code, perform mean subtraction and regress on this data. I've showed you where I've modified the code given your source above:

    import numpy as np 
    x = [6.230825,6.248279,6.265732]
    y = [0.312949,0.309886,0.306639472]
    
    # Calculate mean
    me = sum(x) / len(x)
    # Make new dataset that is mean subtracted
    xx = [pt - me for pt in x]
    
    #toCheck = x[2]
    
    # Data point to check is now mean subtracted
    toCheck = x[2] - me
    
    
    
    def evaluateValue(coeff,x):
        c,b,a = coeff
        val = np.around( a+b*x+c*x**2,9)
        act = 0.306639472
        error=  np.abs(act-val)*100/act
        print("Value = {:.9f} Error = {:.2f}%".format(val,error))
    
    
    
    ###### USing numpy######################
    coeff = np.polyfit(xx,y,2) # Change
    evaluateValue(coeff, toCheck)
    
    
    
    ################# Using explicit formula
    def determinant(a,b,c,d,e,f,g,h,i):
        # the matrix is [[a,b,c],[d,e,f],[g,h,i]]
        return a*(e*i - f*h) - b*(d*i - g*f) + c*(d*h - e*g)
    
    a = b = c = d = e = m = n = p = 0
    a = len(x)
    for i,j in zip(xx,y): # Change
            b += i
            c += i**2
            d += i**3
            e += i**4
            m += j
            n += j*i
            p += j*i**2
    det = determinant(a,b,c,b,c,d,c,d,e)
    c0 = determinant(m,b,c,n,c,d,p,d,e)/det
    c1 = determinant(a,m,c,b,n,d,c,p,e)/det
    c2 = determinant(a,b,m,b,c,n,c,d,p)/det
    
    evaluateValue([c2,c1,c0], toCheck)
    
    
    
    
    ######Using another explicit alternative
    def determinantAlt(a,b,c,d,e,f,g,h,i):
        return a*e*i - a*f*h - b*d*i +b*g*f + c*d*h - c*e*g # <- barckets removed
    
    a = b = c = d = e = m = n = p = 0
    a = len(x)
    for i,j in zip(xx,y): # Change
            b += i
            c += i**2
            d += i**3
            e += i**4
            m += j
            n += j*i
            p += j*i**2
    det = determinantAlt(a,b,c,b,c,d,c,d,e)
    c0 = determinantAlt(m,b,c,n,c,d,p,d,e)/det
    c1 = determinantAlt(a,m,c,b,n,d,c,p,e)/det
    c2 = determinantAlt(a,b,m,b,c,n,c,d,p)/det
    evaluateValue([c2,c1,c0], toCheck)
    

    When I run this, we now get:

    In [41]: run interp_test
    Value = 0.306639472 Error = 0.00%
    Value = 0.306639472 Error = 0.00%
    Value = 0.306639472 Error = 0.00%
    

    As some final reading for you, this is a similar problem that someone else encountered which I addressed in their question: Fitting a quadratic function in python without numpy polyfit. The summary is that I advised them not to use Cramer's Rule and to use least-squares through the pseudo-inverse. I showed them how to get exactly the same results without using numpy.polyfit. Also, using least-squares generalizes where if you have more than 3 points, you can still fit a quadratic through your points so that the model has the smallest error possible.