Search code examples
pythonlinear-regressionstatsmodels

Simple linear regression with constraint


I have developed an algorithm to loop through 15 variables and produce a simple OLS for each variable. Then the algorithm loops a further 11 times to produce the same 15 OLS regressions but the lag of the X variable increases by one each time. I select the independent variables with the highest r^2 and use the optimum lag for 3,4 or 5 variables

i.e.

Y_t+1 - Y_t = B ( X_t+k - X_t) + e

My dataset looks like this:

Regression = pd.DataFrame(np.random.randint(low=0, high=10, size=(100, 6)), 
                columns=['Y', 'X1', 'X2', 'X3', 'X4','X5'])

The OLS regression I have fitted so far uses the following code:

Y = Regression['Y']
X = Regression[['X1','X2','X3']]

Model = sm.OLS(Y,X).fit()
predictions = Model.predict(X)

Model.summary()

The issue is that with OLS, you can get negative coefficients (which I do). I'd appreciate the help in constraining this model with the following:

sum(B_i) = 1

B_i >= 0

Solution

  • This works nicely,

    from scipy.optimize import minimize
    
    # Define the Model
    model = lambda b, X: b[0] * X[:,0] + b[1] * X[:,1] + b[2] * X[:,2]
    
    # The objective Function to minimize (least-squares regression)
    obj = lambda b, Y, X: np.sum(np.abs(Y-model(b, X))**2)
    
    # Bounds: b[0], b[1], b[2] >= 0
    bnds = [(0, None), (0, None), (0, None)]
    
    # Constraint: b[0] + b[1] + b[2] - 1 = 0
    cons = [{"type": "eq", "fun": lambda b: b[0]+b[1]+b[2] - 1}]
    
    # Initial guess for b[1], b[2], b[3]:
    xinit = np.array([0, 0, 1])
    
    res = minimize(obj, args=(Y, X), x0=xinit, bounds=bnds, constraints=cons)
    
    print(f"b1={res.x[0]}, b2={res.x[1]}, b3={res.x[2]}")
    
    #Save the coefficients for further analysis on goodness of fit
    
    beta1 = res.x[0]
    
    beta2 = res.x[1]
    
    beta3 = res.x[2]