Search code examples
linear-regressionfeature-selection

Feature Selection in Multivariate Linear Regression


import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
a = make_regression(n_samples=300,n_features=5,noise=5)
df1 = pd.DataFrame(a[0])
df1 = pd.concat([df1,pd.DataFrame(a[1].T)],axis=1,ignore_index=True)
df1.rename(columns={0:"X1",1:"X2",2:"X3",3:"X4",4:"X5",5:"Target"},inplace=True)
sns.heatmap(df1.corr(),annot=True);

Correlation Matrix

Now I can ask my question. How can I choose features that will be included in the model?


Solution

  • I am not that well-versed in python as I use R most of the time. But it should be something like this:

    # Create a model
    model = LinearRegression()
    # Call the .fit method and pass in your data
    model.fit(Variables,Target)
    # Or simply do
    model = LinearRegression().fit(Variables,Target)
    # So based on the dataset head provided, it should be
    X<-df1[['X1','X2','X3','X4','X5']]
    Y<-df1['Target']
    model = LinearRegression().fit(X,Y)
    

    In order to do feature selections. You need to run the model first. Then check for the p-value. Typically, a p-value of 5% (.05) or less is a good cut-off point. If the p-value crosses the upper threshold of .05, the variable is insignificant and you can remove it from your model. You will have to do this manually. You can also tell by looking from the correlation matrix to see which value has less correlation to the target. AFAIK, there are no libs with built-in functionality to do feature selection automatically. In the end, statistics are just numbers. It is up to humans to interpret the results.