Search code examples
pythonencodinglogistic-regressionmulticlass-classification

python - multi class logistic regression to predict season


I want to complete my logistic regresson algorithm which predicts the annual season based on the store name and purchase category (see below for sample data, and note the label encoding. Store name is any typical string while categories, tops, is one of a variety of uniform string inputs. Same for the four seasons.

store_df.head()

        shop    category    season
    0   594     4           2
    1   644     4           2
    2   636     4           2
    3   675     5           2
    4   644     4           0

My full code is below, and I'm unsure why it's not accepting the shape of my input values. My aim is to leverage shop and category to predict the season.

predict_df = store_df[['shop', 'category', 'season']]
predict_df.reset_index(drop = True, inplace = True)
le = LabelEncoder()
predict_df['shop'] = le.fit_transform(predict_df['shop'].astype('category'))
predict_df['top'] = le.fit_transform(predict_df['top'].astype('category'))
predict_df['season'] = le.fit_transform(predict_df['season'].astype('category'))
X, y = predict_df[['shop', 'top']], predict_df['season']
xtrain, ytrain, xtest, ytest = train_test_split(X, y, test_size=0.2)
lr = LogisticRegression(class_weight='balanced', fit_intercept=False, multi_class='multinomial', random_state=10)
lr.fit(xtrain, ytrain)

When I run the above, I hit the error, ValueError: bad input shape (19405, 2)

My interpretation is that it has to do with two feature inputs, but what would I need to change to be able to use both features?


Solution

  • Here is a working example which you can use to compare your code with and remove any bugs. I have added a few rows to the data frame - the details and the results are after the code. As you can see the model has predicted correctly three out of four labels.

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import confusion_matrix
    
    le = LabelEncoder()
    sc = StandardScaler()
    
    X = pd.get_dummies(df.iloc[:, :2], drop_first=True).values.astype('float')
    y = le.fit_transform(df.iloc[:, -1].values).astype('float')
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)
    
    log_reg = LogisticRegression()
    log_reg.fit(X_train, y_train)
    y_pred = log_reg.predict(X_test)
    
    conf_mat = confusion_matrix(y_test, y_pred)
    
    df
    Out[32]: 
       shop  category  season
    0   594         4       2
    1   644         4       2
    2   636         4       2
    3   675         5       2
    4   644         4       0
    5   642         2       1
    6   638         1       1
    7   466         3       0
    8   455         4       0
    9   643         2       1
    
    y_test
    Out[33]: array([2., 0., 0., 1.])
    
    y_pred
    Out[34]: array([2., 0., 2., 1.])
    
    conf_mat
    Out[35]: 
    array([[1, 0, 1],
           [0, 1, 0],
           [0, 0, 1]], dtype=int64)