python encoding logistic-regression multiclass-classification

python - multi class logistic regression to predict season

I want to complete my logistic regresson algorithm which predicts the annual season based on the store name and purchase category (see below for sample data, and note the label encoding. Store name is any typical string while categories, tops, is one of a variety of uniform string inputs. Same for the four seasons.

store_df.head()

        shop    category    season
    0   594     4           2
    1   644     4           2
    2   636     4           2
    3   675     5           2
    4   644     4           0

My full code is below, and I'm unsure why it's not accepting the shape of my input values. My aim is to leverage shop and category to predict the season.

predict_df = store_df[['shop', 'category', 'season']]
predict_df.reset_index(drop = True, inplace = True)
le = LabelEncoder()
predict_df['shop'] = le.fit_transform(predict_df['shop'].astype('category'))
predict_df['top'] = le.fit_transform(predict_df['top'].astype('category'))
predict_df['season'] = le.fit_transform(predict_df['season'].astype('category'))
X, y = predict_df[['shop', 'top']], predict_df['season']
xtrain, ytrain, xtest, ytest = train_test_split(X, y, test_size=0.2)
lr = LogisticRegression(class_weight='balanced', fit_intercept=False, multi_class='multinomial', random_state=10)
lr.fit(xtrain, ytrain)

When I run the above, I hit the error, ValueError: bad input shape (19405, 2)

My interpretation is that it has to do with two feature inputs, but what would I need to change to be able to use both features?

Solution

Here is a working example which you can use to compare your code with and remove any bugs. I have added a few rows to the data frame - the details and the results are after the code. As you can see the model has predicted correctly three out of four labels.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

le = LabelEncoder()
sc = StandardScaler()

X = pd.get_dummies(df.iloc[:, :2], drop_first=True).values.astype('float')
y = le.fit_transform(df.iloc[:, -1].values).astype('float')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

conf_mat = confusion_matrix(y_test, y_pred)

df
Out[32]: 
   shop  category  season
0   594         4       2
1   644         4       2
2   636         4       2
3   675         5       2
4   644         4       0
5   642         2       1
6   638         1       1
7   466         3       0
8   455         4       0
9   643         2       1

y_test
Out[33]: array([2., 0., 0., 1.])

y_pred
Out[34]: array([2., 0., 2., 1.])

conf_mat
Out[35]: 
array([[1, 0, 1],
       [0, 1, 0],
       [0, 0, 1]], dtype=int64)