Search code examples
pythonscikit-learnlinear-regressionsklearn-pandas

Using sklearn linear regression, how can I constrain the calculated regression coefficients to be greater than 0?


I'm using the reference for sklearn here http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html but there is no option to constrain the regression coefficients.

Does anyone know of another package in python to perform multiple variable linear regression and constrain the regression coefficients to be greater than 0?

Here is the code I have so far.

'''data:
date        A            B              C
10/30/2015  0.063363323 -0.005218807    0.079777558
11/30/2015  -0.013171244    -0.008727599    0.010352028
12/31/2015  -0.017551268    8.09E-05    -0.020491923
1/29/2016   -0.042606469    0.052272139 -0.080362246
2/29/2016   -0.015224562    0.031250961 0.029988488
3/31/2016   0.058291876 -0.000238614    0.056727336
4/29/2016   0.000505675 -0.005325338    0.02854057
5/31/2016   0.012766515 0.008548162 -0.001631845
6/30/2016   -0.038981203    0.064236963 0.00570145
7/29/2016   0.033715429 0.024269606 0.02703294
8/31/2016   -0.002083837    -0.009439625    0.004129397
9/30/2016   -0.009825674    -0.01737909 -0.019251885
11/30/2016  0.0084733   -0.11668582 0.031928726
12/30/2016  0.017084282 -0.005553088    0.029372131
1/31/2017   0.014263947 0.004036504 0.00187079
2/28/2017   0.037375566 0.016081105 0.039331615
3/31/2017   -0.002494984    -0.005942793    -0.002097504
4/28/2017   -0.005054922    0.015685226 0.008243977
5/31/2017   0.002285393 0.020771375 0.002697755
6/30/2017   0.002841457 0.004886117 0.019202011
7/31/2017   0.014866638 -0.006900926    0.010126577
8/31/2017   -0.016647997    0.035687133 -0.008709075
9/29/2017   0.019523651 -0.022154361    0.020468398
10/31/2017  0.019407629 -0.000705663    0.016574416
11/30/2017  0.027486425 0.008008173 0.033427299
12/29/2017  0.007861222 0.018095096 0.017908809
1/31/2018   0.058702838 -0.032765285    0.05
'''

reg = linear_model.LinearRegression(fit_intercept=False)
reg.fit(df[['B', 'C']], df['A'])

print(reg.coef_)

# [ 0.67761268 -0.08845756]

Working code below

from scipy.optimize import lsq_linear   

lb = 0
ub = np.Inf
res = lsq_linear(df[['B', 'C']], 
                 df['A'], 
                 bounds=(lb, ub))

print(res.x)

Solution

  • sklearn is just wrapping scipy's lstsq which does not support this.

    You can easily modify sklearn's code though:

        if sp.issparse(X):
            if y.ndim < 2:
                out = sparse_lsqr(X, y)
                self.coef_ = out[0]
                self._residues = out[3]
            else:
                # sparse_lstsq cannot handle y with shape (M, K)
                outs = Parallel(n_jobs=n_jobs_)(
                    delayed(sparse_lsqr)(X, y[:, j].ravel())
                    for j in range(y.shape[1]))
                self.coef_ = np.vstack(out[0] for out in outs)
                self._residues = np.vstack(out[3] for out in outs)
        else:
            self.coef_, self._residues, self.rank_, self.singular_ = \
                linalg.lstsq(X, y)
            self.coef_ = self.coef_.T
    

    Just replace lstsq / lsqr with scipy's nnls (dense!!!) or lsq_linear with manually-set bounds (for large-scale: optimize.minimize with method lbfgs is another candidate although you need to prepare the gradient and there are at least two different common approaches: e.g. using pre-computed: A.T*A which loses sparseness).

    Remark: those methods are minimizing different functions (norm vs. squared norm; 0.5 factor vs. 1.0 factor). This does not change the result in terms of the vector found, but the objective of course looks different and you should take care of this (if needed).