I have a basketball stats data set with 656 factors. I am using a logistic regression classifier to predict winners and losers (team 1 wins or team 2 wins) by subtracting team 1 stats from team 2 stats. Other than normalization how can I improve the accuracy of my testing set to get it closer to accuracy of training set or just improving accuracy in general?
I saw normalization as a possible solution, but since I am doing the difference of stats most of the values are in the same range
X = final_data_array[:,:656]
Y = final_data_array[:,656]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
logistic = LogisticRegression(solver='lbfgs', max_iter=4000000, multi_class='multinomial').fit(X_train, Y_train)
print(logistic.score(X_test, Y_test))
print(logistic.score(X_train, Y_train))
0.7818791946308725
0.9069506726457399
You may try to do some feature engineering on dataset, beyond that normalize the dataset and check accuracy. I also recommend you to try other classification algorithms like xgbclassifier, random forest classifier etc.