I want to complete my logistic regresson algorithm which predicts the annual season based on the store name and purchase category (see below for sample data, and note the label encoding. Store name is any typical string while categories, tops
, is one of a variety of uniform string inputs. Same for the four seasons.
store_df.head()
shop category season
0 594 4 2
1 644 4 2
2 636 4 2
3 675 5 2
4 644 4 0
My full code is below, and I'm unsure why it's not accepting the shape of my input values. My aim is to leverage shop and category to predict the season.
predict_df = store_df[['shop', 'category', 'season']]
predict_df.reset_index(drop = True, inplace = True)
le = LabelEncoder()
predict_df['shop'] = le.fit_transform(predict_df['shop'].astype('category'))
predict_df['top'] = le.fit_transform(predict_df['top'].astype('category'))
predict_df['season'] = le.fit_transform(predict_df['season'].astype('category'))
X, y = predict_df[['shop', 'top']], predict_df['season']
xtrain, ytrain, xtest, ytest = train_test_split(X, y, test_size=0.2)
lr = LogisticRegression(class_weight='balanced', fit_intercept=False, multi_class='multinomial', random_state=10)
lr.fit(xtrain, ytrain)
When I run the above, I hit the error, ValueError: bad input shape (19405, 2)
My interpretation is that it has to do with two feature inputs, but what would I need to change to be able to use both features?
Here is a working example which you can use to compare your code with and remove any bugs. I have added a few rows to the data frame - the details and the results are after the code. As you can see the model has predicted correctly three out of four labels.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
le = LabelEncoder()
sc = StandardScaler()
X = pd.get_dummies(df.iloc[:, :2], drop_first=True).values.astype('float')
y = le.fit_transform(df.iloc[:, -1].values).astype('float')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred)
df
Out[32]:
shop category season
0 594 4 2
1 644 4 2
2 636 4 2
3 675 5 2
4 644 4 0
5 642 2 1
6 638 1 1
7 466 3 0
8 455 4 0
9 643 2 1
y_test
Out[33]: array([2., 0., 0., 1.])
y_pred
Out[34]: array([2., 0., 2., 1.])
conf_mat
Out[35]:
array([[1, 0, 1],
[0, 1, 0],
[0, 0, 1]], dtype=int64)