python scikit-learn linear-regression sparse-matrix sklearn-pandas

python linear regression: dense vs sparse

I need to use linear regression on a sparse matrix. I have been getting poor results, so I decided to test it on a non-sparse matrix represented sparsely. The data is taken from https://www.analyticsvidhya.com/blog/2021/05/multiple-linear-regression-using-python-and-scikit-learn/.

I have generated max-normalized values for some of the columns. The CSV file is here: https://drive.google.com/file/d/17wHv1Cc3RKgshprIKTcWUSxZOWlG68__/view?usp=sharing

Running normal linear regression works fine. Sample code:

df = pd.read_csv("maxnorm_50_Startups.csv")
y = pd.DataFrame()
y = df['Profit']
x = pd.DataFrame()
x = df.drop('Profit', axis=1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
LR = LinearRegression()
LR.fit(x_train, y_train)
y_prediction = LR.predict(x_test)
score=r2_score(y_test, y_prediction)
print('r2 score is', score)

with sample result:

r2 score is 0.9683831928840445

I want to repeat this with a sparse matrix. I convert the CSV to a sparse representation: https://drive.google.com/file/d/1CFWbBbtiSqTSlepGuYXsxa00MSHOj-Vx/view?usp=sharing

Here is my code to do linear regression on it:

df = pd.read_csv("maxnorm_50_Startups_relational.csv")
df['x'] = pd.to_numeric(df['x'], errors='raise')

m = len(df.x.unique())

for i in range(0, m): # randomize the 'x' values to randomize train test split
    n = random.randint(0, m)
    df.loc[df['x'] == n, 'x'] = m
    df.loc[df['x'] == i, 'x'] = n
    df.loc[df['x'] == m, 'x'] = i
    
y = pd.DataFrame()
y = df[df['feature'] == 'Profit']
x = pd.DataFrame()
x = df[df['feature'] != 'Profit']
    
y = y.drop('feature', axis=1)

x['feat'] = pd.factorize(x['feature'])[0] # sparse matrix code below can't work with strings

x_train = pd.DataFrame()
x_train = x[x['x'] <= 39]
x_test = pd.DataFrame()
x_test = x[x['x'] >= 40]

y_train = pd.DataFrame()
y_train = y[y['x'] <= 39]
y_test = pd.DataFrame()
y_test = y[y['x'] >= 40]

x_test['x'] = x_test['x'] - 40 # sparse matrix assumes that if something is numbered 50
y_test['x'] = y_test['x'] - 40 # there must be 50 records. there are 10. so renumber to 10

x_train_sparse = scipy.sparse.coo_matrix((x_train.value, (x_train.x, x_train.feat)))
# print(x_train_sparse.todense())
x_test_sparse = scipy.sparse.coo_matrix((x_test.value, (x_test.x, x_test.feat)))
LR = LinearRegression()
LR.fit(x_train_sparse, y_train)
y_prediction = LR.predict(x_test_sparse)
score = r2_score(y_test, y_prediction)
print('r2 score is', score)

Running this, I get negative R2 scores, such as:

r2 score is -10.794519939249602

meaning the linear regression is not working. I don't know where I am going wrong. I tried implementing the linear regression equations myself instead of using the library functions, and I still get negative r2 score. What is my mistake?

Solution

Linear Regression performs poorly on sparse data.

There are other linear algorithms like Ridge , Lasso, Bayesian Ridge and ElasticNet that performs equally on both dense and sparse data. These algorithms are similar to linear regression but their loss function contains an extra penality term.

There are some non linear algorithms like RandomForestRegressor , GradientBoostingRegressor , ExtraTreesRegressor , XGBoostRegressor etc. that also performs equally on sparse and dense matrix.

I would recommend you to use these algorithms rather than simple linear regression.