Search code examples
pythonmachine-learninglinear-regressioncross-validationk-fold

How to plot the data and model fit for each fold after kfold cross validation?


I am trying to predict one label variable based on one feature. The two seems to be highly linearly correlated. I chose a linear regression model to describe the data. The output of my code shows R2 score for the training and testing data. My model performs well, expect for one fold for the test sample, where R2 is negative. I would like to plot the data of each fold and the fit of the model, to get an idea of what's going wrong. However, I could not figure out how to do it from a python coding point of view.

Anyone could help?


Test_scores = list()
Train_scores =list()
n_splits = 5
kfold = KFold(n_splits=n_splits
              , shuffle=False)
for train_ix, test_ix in kfold.split(Feature_X):
    Train_Feature_X, Test_Feature_X=Feature_X[train_ix], Feature_X[test_ix]
    Train_label_X, Test_label_X= label_X[train_ix],label_X[test_ix]
    model = LinearRegression()
    model.fit(Train_Feature_X, Train_label_X)
    pred_label_train = model.predict(Train_Feature_X)
    acc_train = r2_score(Train_label_X, pred_label_train)
    pred_label_test = model.predict(Test_Feature_X)
    acc_test = r2_score(Test_label_X, pred_label_test)
    Test_scores.append(acc_test)
    Train_scores.append(acc_train)
    print('> ', 'Train:'+ str(acc_train), "Test:"+ str(acc_test))
Test_mean, Test_std = np.mean(Test_scores), np.std(Test_scores)
Train_mean, Train_std = np.mean(Train_scores), np.std(Train_scores)

print('Mean of test: %.3f, Standard Deviation: %.3f' % (Test_mean, Test_std))
print('Mean of train: %.3f, Standard Deviation: %.3f' % (Train_mean, Train_std))



output of code:

>  Train:0.9948113361306588 Test:0.9715872368615199
>  Train:0.9905854864161807 Test:0.9917503220348162
>  Train:0.9888929852977923 Test:-4.996610921978263
>  Train:0.990942242777374 Test:0.9960355777732937
>  Train:0.9925744355834707 Test:0.9458246438971184
Mean of test: -0.218, Standard Deviation: 2.389
Mean of train: 0.992, Standard Deviation: 0.002

enter image description here


Solution

  • You can just add plotting into the loop cycle.

    Each iteration you have access to the train-test fold and to the prediction, so before you print the values, i.e. print('> ', 'Train:'+ str(acc_train), "Test:"+ str(acc_test)) you can do something like:

    fig, ax = plt.subplots(nrows=1, ncols=5)
    curr_split = 1
    for ...
    
        plt.subplot(1, 5, curr_split)
        plt.plot(x, y)
        curr_split += 1
    plt.show()
    

    This will plot 5 sub-plots each represent the fold.

    Note that this is general outline of what you should do, please refer to the docs in the following link plt.subplots()