Search code examples
pythonpandasscikit-learnregressionlasso-regression

Lasso Regression with Python: Simple Question


Assume I have a table of values:

df = pd.DataFrame({'Y1':[1, 2, 3, 4, 5, 6], 'X1':[1, 2, 3, 4, 5, 6], 'X2':[1, 1, 2, 1, 1, 1], 
              'X3':[6, 6, 6, 5, 6, 4], 'X4':[6, 5, 4, 3, 2, 1]})

I want to make a simple Lasso regression using all of these values as my testing set, where Y1 is the dependent variable and all the X1...X4 are the independent variables. I´ve tried using the following:

from sklearn.linear_model import Lasso
Lasso(alpha = 0.0001).fit(df, df['Y1'])

but it doesn´t give me the coefficients that I want. How do I go about performing this simple task? Thanks.


Solution

  • I dont think you fully understand what the coefficients mean. First of all, you should not be regressing 'Y1' on all of your variables (with 'Y1' included). Don't include 'Y1' in your independent variables:

    Lasso(alpha = 0.0001).fit(df[['X1','X2','X3','X4']], df['Y1'])
    

    Lasso is just a method of "shrinking" your set of independent variables for a linear model (by attempting to find a subset of independent variables that predict your dependent variable well). What you need to understand is what linear regression is doing. Remember that the objective of a linear regression is to create a linear model that can be used to predict values of your dependent variable. You might propose the following model (which is what you are trying to solve for when doing linear regression - specifically you are solving for the coefficients):

    Y1 = b1*X1 + b2*X2 + b3*X3 + b4*X4

    Now if we use the coefficients you suggested (leaving 'Y1' in), then the model would be:

    Y1 = Y1 + X1 - X4

    But you can obviously see that this does not predict 'Y1' very well. We can alter the model to be just:

    Y1 = Y1

    'Y1' predicts 'Y1' perfectly (duh). This is why your output for coefficient is [ 1, 0, -0, -0, -0]. But, this is not what we want when running a regression. Like I said before, you want to leave 'Y1' out of the regression. Thus, using the coefficients you suggested and leaving out 'Y1', your model would be:

    Y1 = X1 - X4

    Notice again, that this does not predict 'Y1' very well (you can test out some points from your dataset). Instead, you could use the following model to predict 'Y1' perfectly:

    Y1 = X1

    Thus if you lasso regress 'Y1' on 'X1','X2','X3','X4' you should get coefficients of [1, 0, 0, 0].