I need to use linear regression on a sparse matrix. I have been getting poor results, so I decided to test it on a non-sparse matrix represented sparsely. The data is taken from https://www.analyticsvidhya.com/blog/2021/05/multiple-linear-regression-using-python-and-scikit-learn/.
I have generated max-normalized values for some of the columns. The CSV file is here: https://drive.google.com/file/d/17wHv1Cc3RKgshprIKTcWUSxZOWlG68__/view?usp=sharing
Running normal linear regression works fine. Sample code:
df = pd.read_csv("maxnorm_50_Startups.csv")
y = pd.DataFrame()
y = df['Profit']
x = pd.DataFrame()
x = df.drop('Profit', axis=1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
LR = LinearRegression()
LR.fit(x_train, y_train)
y_prediction = LR.predict(x_test)
score=r2_score(y_test, y_prediction)
print('r2 score is', score)
with sample result:
r2 score is 0.9683831928840445
I want to repeat this with a sparse matrix. I convert the CSV to a sparse representation: https://drive.google.com/file/d/1CFWbBbtiSqTSlepGuYXsxa00MSHOj-Vx/view?usp=sharing
Here is my code to do linear regression on it:
df = pd.read_csv("maxnorm_50_Startups_relational.csv")
df['x'] = pd.to_numeric(df['x'], errors='raise')
m = len(df.x.unique())
for i in range(0, m): # randomize the 'x' values to randomize train test split
n = random.randint(0, m)
df.loc[df['x'] == n, 'x'] = m
df.loc[df['x'] == i, 'x'] = n
df.loc[df['x'] == m, 'x'] = i
y = pd.DataFrame()
y = df[df['feature'] == 'Profit']
x = pd.DataFrame()
x = df[df['feature'] != 'Profit']
y = y.drop('feature', axis=1)
x['feat'] = pd.factorize(x['feature'])[0] # sparse matrix code below can't work with strings
x_train = pd.DataFrame()
x_train = x[x['x'] <= 39]
x_test = pd.DataFrame()
x_test = x[x['x'] >= 40]
y_train = pd.DataFrame()
y_train = y[y['x'] <= 39]
y_test = pd.DataFrame()
y_test = y[y['x'] >= 40]
x_test['x'] = x_test['x'] - 40 # sparse matrix assumes that if something is numbered 50
y_test['x'] = y_test['x'] - 40 # there must be 50 records. there are 10. so renumber to 10
x_train_sparse = scipy.sparse.coo_matrix((x_train.value, (x_train.x, x_train.feat)))
# print(x_train_sparse.todense())
x_test_sparse = scipy.sparse.coo_matrix((x_test.value, (x_test.x, x_test.feat)))
LR = LinearRegression()
LR.fit(x_train_sparse, y_train)
y_prediction = LR.predict(x_test_sparse)
score = r2_score(y_test, y_prediction)
print('r2 score is', score)
Running this, I get negative R2 scores, such as:
r2 score is -10.794519939249602
meaning the linear regression is not working. I don't know where I am going wrong. I tried implementing the linear regression equations myself instead of using the library functions, and I still get negative r2 score. What is my mistake?
Linear Regression
performs poorly on sparse data.
There are other linear algorithms like Ridge
, Lasso
, Bayesian Ridge
and ElasticNet
that performs equally on both dense and sparse data. These algorithms are similar to linear regression but their loss function contains an extra penality term.
There are some non linear algorithms like RandomForestRegressor
, GradientBoostingRegressor
, ExtraTreesRegressor
, XGBoostRegressor
etc. that also performs equally on sparse and dense matrix.
I would recommend you to use these algorithms rather than simple linear regression.