Search code examples
pythonpython-3.xmachine-learningscikit-learnk-fold

How can we include a prediction column in the initial dataset/dataframe after performing K-Fold cross validation?


I would like to run a K-fold cross validation on my data using a classifier. I want to include the prediction (or predicted probability) columns for each sample directly into the initial dataset/dataframe. Any ideas?

from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.model_selection import KFold

k = 5
kf = KFold(n_splits=k, random_state=None)

acc_score = []
auroc_score = []

for train_index , test_index in kf.split(X):
    X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
    y_train , y_test = y[train_index] , y[test_index]

    model.fit(X_train, y_train)
    pred_values = model.predict(X_test)
    predict_prob = model.predict_proba(X_test.values)[:,1]

    auroc = roc_auc_score(y_test, predict_prob)
    acc = accuracy_score(pred_values , y_test)

    auroc_score.append(auroc)
    acc_score.append(acc)

avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))
print('AUROC of each fold - {}'.format(auroc_score))
print('Avg AUROC : {}'.format(sum(auroc_score)/k))

Given this code, how could I begin to generate such an idea: add a prediction column or, even better, the prediction probability columns for each sample within the initial dataset?

In 10-fold cross-validation, each example (sample) will be used exactly once in a test set and 9 times in a training set. So, after 10-fold cross-validation, the result should be a dataframe where I would have the predicted class for ALL examples in the dataset. Each example will be assigned its initial features, its labelled class, and the class predicted computed in the cross-validation fold where that example was used in the test set.


Solution

  • You can use the .loc method to accomplish this. This question has a nice answer that shows how to use it: df.loc[index_position, "column_name"] = some_value

    So, an edited version of the code you posted (I needed data, and removed auc_roc since we aren't using probabilities per your edit):

    from sklearn.metrics import accuracy_score, roc_auc_score
    import pandas as pd
    from sklearn.model_selection import KFold
    from sklearn.datasets import load_breast_cancer
    from sklearn.neural_network import MLPClassifier
    
    X,y = load_breast_cancer(return_X_y=True, as_frame=True)
    model = MLPClassifier()
    
    k = 5
    kf = KFold(n_splits=k, random_state=None)
    
    acc_score = []
    auroc_score = []
    
    # Create columns
    X['Prediction'] = 1
    
    # Define what values to use for the model
    model_columns = [x for x in X.columns if x != 'Prediction']
    
    for train_index , test_index in kf.split(X):
        X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
        y_train , y_test = y[train_index] , y[test_index]
    
        model.fit(X_train[model_columns], y_train)
        pred_values = model.predict(X_test[model_columns])
    
        acc = accuracy_score(pred_values , y_test)
        acc_score.append(acc)
    
        # Add values to the dataframe
        X.loc[test_index, 'Prediction'] = pred_values
    
    avg_acc_score = sum(acc_score)/k
    print('accuracy of each fold - {}'.format(acc_score))
    print('Avg accuracy : {}'.format(avg_acc_score))
    
    # Add label back per question
    X['Label'] = y
    
    # Print first 5 rows to show that it works
    print(X.head(n=5))
    

    Yields

    accuracy of each fold - [0.9210526315789473, 0.9122807017543859, 0.9736842105263158, 0.9649122807017544, 0.8672566371681416]
    Avg accuracy : 0.927837292345909
       mean radius  mean texture  ...  Prediction  Label
    0        17.99         10.38  ...           0      0
    1        20.57         17.77  ...           0      0
    2        19.69         21.25  ...           0      0
    3        11.42         20.38  ...           1      0
    4        20.29         14.34  ...           0      0
    
    [5 rows x 32 columns]
    

    (Obviously the model/values etc are all arbitrary)