Search code examples
pythonmachine-learningscikit-learnxgboost

How do I handle a model dataset that has a column of IDs?


I am trying to build a model for NFL Draft prospects probability of success. I am having trouble finding a way to print the players names with their corresponding model output. For example, currently it prints something like this "[79 22 36 72 20 48 2 68 16 36 11 68 68 16 22 17 60 62 15 17 11 68 0 84 28 22 45 48 79 84 2 37 68]", I would like the player associated with those outputs to print as well. I am working with some template code I found online for the type of model I would like to build. I will post it below.

LINK TO DATA: https://docs.google.com/spreadsheets/d/1BQa34rfq7oC3jOO65c4xUqKTuhDGKf46pPwGmjSS3ko/edit?usp=sharing

Column "Player" really doesn't matter during training as this data is historical drafts going back to 2004 but obviously for the final output when I ask the model to predict this years prospects I would needs the names output as well.

    import pandas as pd
    import xgboost
    from sklearn import model_selection
    from sklearn.metrics import accuracy_score
    from sklearn.preprocessing import LabelEncoder
    
    # load data
    data = pd.read_csv(r"C:\Users\yanke\Documents\NFLDraft\QBDataSet.csv", index_col=0)
    dataset = data
    
    # split data into X and y
    X = dataset.iloc[:,0:4]
    Y = dataset.iloc[:,4]
    # encode string class values as integers
    label_encoder = LabelEncoder()
    label_encoder = label_encoder.fit(Y)
    label_encoded_y = label_encoder.transform(Y)
    
    seed = 7
    test_size = 0.33
    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, label_encoded_y, test_size=test_size, random_state=seed)
    
    # fit model no training data
    model = xgboost.XGBClassifier()
    model.fit(X_train, y_train)
    print(model)
    
    # make predictions for test data
    y_pred = model.predict(X_test)
    predictions = [round(value) for value in y_pred]
    
    # evaluate predictions
    accuracy = accuracy_score(y_test, predictions)
    print("Accuracy: %.2f%%" % (accuracy * 100.0))
    print(y_pred)

Solution

  • Will this work?

    for player, prediction in zip(X_test.index, predictions):
      print(player, prediction)
    

    Output:

    Colin Kaepernick 3
    Jeff Driskel 2
    Dwayne Haskins 1
    Colt McCoy 1
    Ryan Lindley 2
    Jameis Winston 2
    Sam Darnold 1
    Sam Bradford 1
    Troy Smith 1
    Johnny Manziel 1
    Matthew Stafford 3
    Kyler Murray 2
    Daniel Jones 2
    Gardner Minshew 1
    Joe Webb 2
    Curtis Painter 1
    Andrew Luck 1
    Josh Freeman 2
    Landry Jones 1
    Ryan Finley 1
    Deshaun Watson 1
    Marcus Mariota 1
    Dan Orlovsky 1
    Russell Wilson 2
    Nathan Peterman 1
    Kyle Orton 2
    Paxton Lynch 2
    Alex Smith 1
    Brodie Croyle 1
    Vince Young 2
    Brandon Weeden 1
    Teddy Bridgewater 1
    Brett Hundley 1