Search code examples
scikit-learndata-sciencesklearn-pandasscikit-optimizeauto-sklearn

Train and test data setup for sklearn


I'm creating a classification model to predict the outcome of sports event(win/loss) and am running into a data setup conundrum. Currently the data is setup as follows:

example_data = [team_a_feat_1, team_a_feat_2...team_b_feat_1, team_b_feat_2... OUTCOME_A_B]

But am wondering if the following would be possible/more logical.

example_data = [[team_a_feat_1, team_a_feat_2...]
                [team_b_feat_1, team_b_feat_2...] OUTCOME_A_B]]

Does sklearn allow data to be passed in as such and if so would it make a difference on the outcome of the model. I ask this because I want the features to be treated as equals between the teams and not as different variables.

Thoughts and suggestions? Am I overthinking this step and does this really affect performance?


Solution

  • Scikit-learn does not directly support passing nested lists for each feature like you've shown in the second example.

    You need to standardize your features. You can do this using the StandardScaler from scikit-learn, which will normalize the features to have a mean of 0 and a standard deviation of 1:

    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(example_data[:, :-1])