Search code examples
pythonlogistic-regressionhyperparameters

Making my logistic regression testing accuracy closer to my training accuracy with Python


I have a basketball stats data set with 656 factors. I am using a logistic regression classifier to predict winners and losers (team 1 wins or team 2 wins) by subtracting team 1 stats from team 2 stats. Other than normalization how can I improve the accuracy of my testing set to get it closer to accuracy of training set or just improving accuracy in general?

I saw normalization as a possible solution, but since I am doing the difference of stats most of the values are in the same range

Code:

X = final_data_array[:,:656]

Y = final_data_array[:,656]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

logistic = LogisticRegression(solver='lbfgs', max_iter=4000000, multi_class='multinomial').fit(X_train, Y_train)

print(logistic.score(X_test, Y_test))

print(logistic.score(X_train, Y_train))

0.7818791946308725

0.9069506726457399

Solution

  • You may try to do some feature engineering on dataset, beyond that normalize the dataset and check accuracy. I also recommend you to try other classification algorithms like xgbclassifier, random forest classifier etc.