Search code examples
pythonscipylinear-regressionquadratic-programming

constrained linear regression / quadratic programming python


I have a dataset like this:

import numpy as np

a = np.array([1.2, 2.3, 4.2])
b = np.array([1, 5, 6])
c = np.array([5.4, 6.2, 1.9])

m = np.vstack([a,b,c])
y = np.array([5.3, 0.9, 5.6])

and want to fit a constrained linear regression

y = b1*a + b2*b + b3*c

where all b's sum to one and are positive: b1+b2+b3=1

A similar problem in R is specified here:

https://stats.stackexchange.com/questions/21565/how-do-i-fit-a-constrained-regression-in-r-so-that-coefficients-total-1

How can I do this in python?


Solution

  • EDIT: These two approaches are very general and can work for small-medium scale instances. For a more efficient approach, check the answer of chthonicdaemon (using customized preprocessing and scipy's optimize.nnls).

    Using scipy

    Code

    import numpy as np
    from scipy.optimize import minimize
    
    a = np.array([1.2, 2.3, 4.2])
    b = np.array([1, 5, 6])
    c = np.array([5.4, 6.2, 1.9])
    
    m = np.vstack([a,b,c])
    y = np.array([5.3, 0.9, 5.6])
    
    def loss(x):
        return np.sum(np.square((np.dot(x, m) - y)))
    
    cons = ({'type': 'eq',
             'fun' : lambda x: np.sum(x) - 1.0})
    
    x0 = np.zeros(m.shape[0])
    res = minimize(loss, x0, method='SLSQP', constraints=cons,
                   bounds=[(0, np.inf) for i in range(m.shape[0])], options={'disp': True})
    
    print(res.x)
    print(np.dot(res.x, m.T))
    print(np.sum(np.square(np.dot(res.x, m) - y)))
    

    Output

    Optimization terminated successfully.    (Exit mode 0)
            Current function value: 18.817792344
            Iterations: 5
            Function evaluations: 26
            Gradient evaluations: 5
    [ 0.7760881  0.         0.2239119]
    [ 1.87173571  2.11955951  4.61630834]
    18.817792344
    

    Evaluation

    • It's obvious that the model-capabilitites / model-complexity is not enough to obtain a good performance (high loss!)

    Using general-purpose QP/SOCP-optimization modelled by cvxpy

    Advantages:

    • cvxpy prooves, that the problem is convex
    • convergence for convex optimization problems is guaranteed (might be also true for the above)
    • in general: better accuracy
    • in general: more robust in regards to numerical instabilities (solver is only able to solve SOCPs; not non-convex models like the SLSQP-approach above!)

    Code

    import numpy as np
    from cvxpy import *
    
    a = np.array([1.2, 2.3, 4.2])
    b = np.array([1, 5, 6])
    c = np.array([5.4, 6.2, 1.9])
    
    m = np.vstack([a,b,c])
    y = np.array([5.3, 0.9, 5.6])
    
    X = Variable(m.shape[0])
    constraints = [X >= 0, sum_entries(X) == 1.0]
    
    product = m.T * diag(X)
    diff = sum_entries(product, axis=1) - y
    problem = Problem(Minimize(norm(diff)), constraints)
    problem.solve(verbose=True)
    
    print(problem.value)
    print(X.value)
    

    Output

    ECOS 2.0.4 - (C) embotech GmbH, Zurich Switzerland, 2012-15. Web: www.embotech.com/ECOS
    
    It     pcost       dcost      gap   pres   dres    k/t    mu     step   sigma     IR    |   BT
     0  +0.000e+00  -0.000e+00  +2e+01  5e-01  1e-01  1e+00  4e+00    ---    ---    1  1  - |  -  - 
     1  +2.451e+00  +2.539e+00  +4e+00  1e-01  2e-02  2e-01  8e-01  0.8419  4e-02   2  2  2 |  0  0
     2  +4.301e+00  +4.306e+00  +2e-01  5e-03  7e-04  1e-02  4e-02  0.9619  1e-02   2  2  2 |  0  0
     3  +4.333e+00  +4.334e+00  +2e-02  4e-04  6e-05  1e-03  4e-03  0.9326  2e-02   2  1  2 |  0  0
     4  +4.338e+00  +4.338e+00  +5e-04  1e-05  2e-06  4e-05  1e-04  0.9698  1e-04   2  1  1 |  0  0
     5  +4.338e+00  +4.338e+00  +3e-05  8e-07  1e-07  3e-06  7e-06  0.9402  7e-03   2  1  1 |  0  0
     6  +4.338e+00  +4.338e+00  +7e-07  2e-08  2e-09  6e-08  2e-07  0.9796  1e-03   2  1  1 |  0  0
     7  +4.338e+00  +4.338e+00  +1e-07  3e-09  4e-10  1e-08  3e-08  0.8458  2e-02   2  1  1 |  0  0
     8  +4.338e+00  +4.338e+00  +7e-09  2e-10  2e-11  9e-10  2e-09  0.9839  5e-02   1  1  1 |  0  0
    
    OPTIMAL (within feastol=1.7e-10, reltol=1.5e-09, abstol=6.5e-09).
    Runtime: 0.000555 seconds.
    
    4.337947939  # needs to be squared to be compared to scipy's output!
                 #  as we are using l2-norm (outer sqrt) instead of sum-of-squares
                 #  which is nicely converted to SOCP-form and easier to
                 #  tackle by SOCP-based solvers like ECOS
                 #  -> does not change the solution-vector x, only the obj-value
    [[  7.76094262e-01]
     [  7.39698388e-10]
     [  2.23905737e-01]]