Search code examples
pythonregressionrobust

Huber Regressor returns inconsistent sign of coefficient


I've been trying to perform a Huber Regressor (sklearn.linear_model) on Time Series. I came across a strange phenomena: at time it returns a very small negative coeff and sometimes positive although the figures change upwards constantly.

For example:

HuberRegressor(epsilon=1.35,max_iter=10000).fit(X=np.arange(12).reshape(-1,1),y=pd.Series([0,0,0,0,0,0,0,0,0,0,0,1])).coef_

array([3.1380359e-08])

HuberRegressor(epsilon=1.35,max_iter=10000).fit(X=np.arange(12).reshape(-1,1),y=pd.Series([0,0,0,0,0,0,0,0,0,0,0,2])).coef_

array([-2.20536164e-10])

HuberRegressor(epsilon=1.35,max_iter=10000000).fit(X=np.arange(12).reshape(-1,1),y=pd.Series([0,0,0,0,0,0,0,0,0,0,0,5])).coef_

array([7.63157014e-07])

HuberRegressor(epsilon=1.35,max_iter=10000).fit(X=np.arange(12).reshape(-1,1),y=pd.Series([0,0,0,0,0,0,0,0,0,0,0,248])).coef_

array([-4.49809127e-07])

So, all I did was raising the last observation (1,2,5,248), and yet the sign of the coefficient changes. As this is Huber Regression, 1,3,5,248 are all identified as outliers as all others are zeros. Should I identify something different in the model?


Solution

  • TL;DR

    When adapting your example to reproduce the error I got the feeling that it is due to Float Arithmetic error because you are assessing a null slope using numerical algorithm with float arithmetic.

    Float Arithmetic Error

    It maybe strange to get the wrong sign if computation were performed with real numbers but this is a common phenomenon with float numbers.

    Lets adapt your MCVE to:

    import numpy as np
    from sklearn import linear_model 
    
    u = 1
    c = []
    x = np.arange(12).reshape(-1,1)
    for k in [0, 1, 2, 5, 25, 100, 200, 500, 1000]:
        y = np.array([u]*(len(x)-1)+[k])
        m = linear_model.HuberRegressor(tol=1e-16).fit(X=x, y=y)
        c.append(m.coef_[0])
    
    [-5.923749784709837e-09,
     -4.9322755264475916e-11,
     2.5190368660836694e-10,
     8.3699110105873e-07,
     -4.3671163925160265e-10,
     1.2964166828133428e-08,
     1.5063190859596705e-06,
     1.5063100994140354e-06,
     1.5063152932632047e-06]
    

    I admit that most of returned number in this case are usually too big to be considered as zero, but I still suspect it this the case because changing tol switch does affect the result and changing for a non null slope does return the expected result.

    Working Use Case

    We can change the data to assess a unitary slope, we get the following results:

    c = []
    x = np.arange(12).reshape(-1,1)
    for k in [0, 1, 2, 5, 25, 100, 200, 500, 1000]:
        y = np.array(list(x[:-1])+[k])
        m = linear_model.HuberRegressor(tol=1e-16).fit(X=x, y=y)
        c.append(m.coef_[0])
    
    [0.9999999423436053,
     0.9999991468916037,
     0.9999999916835522,
     0.9999999916837012,
     1.0000000059171992,
     1.000000006233575,
     1.000000006237059,
     1.000002303772441,
     1.0000023037721706]
    

    Now the information is below or above the unit and this is the expected result.

    My intuition is: the test you performed give a strange results because it is at the boundary where machine precision (acceptable value for zero) and algorithm precision have the same magnitude, hence the bad sign. When the value of the coefficient become significantly larger than the algorithm precision, the phenomenon disappear.

    This is a common issue with float arithmetic and requires developers to design stable and accurate algorithms which is a complex task. A good reference covering many aspect of it is:

    N. J. Higham. Accuracy and Stability of Numerical Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition, 2002. ISBN 0-89871-521-0

    My advice: you may open an issue on sklearn to highlight this observation, they may add a new test in their unit test suite to cope with this specific use case and provide you more insights on what is going on under the hood.