Search code examples
pythonmachine-learningscikit-learncross-validation

I could not get the idea about difference between cross_val_score and accuracy_score


I am trying to understand cross validation score and accuracy score. I got accuracy score = 0.79 and cross validation score = 0.73. As I know, these scores should have been very close to each other. What can I say about my model by just looking at these scores ?

sonar_x = df_2.iloc[:,0:61].values.astype(int)
sonar_y = df_2.iloc[:,62:].values.ravel().astype(int)

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.ensemble import RandomForestClassifier

x_train,x_test,y_train,y_test=train_test_split(sonar_x,sonar_y,test_size=0.33,random_state=0)

rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)

folds = KFold(n_splits = 10, shuffle = False, random_state = 0)
scores = []

for n_fold, (train_index, valid_index) in enumerate(folds.split(sonar_x,sonar_y)):
    print('\n Fold '+ str(n_fold+1 ) + 
          ' \n\n train ids :' +  str(train_index) +
          ' \n\n validation ids :' +  str(valid_index))
    
    x_train, x_valid = sonar_x[train_index], sonar_x[valid_index]
    y_train, y_valid = sonar_y[train_index], sonar_y[valid_index]
    
    rf.fit(x_train, y_train)
    y_pred = rf.predict(x_test)
    
    
    acc_score = accuracy_score(y_test, y_pred)
    scores.append(acc_score)
    print('\n Accuracy score for Fold ' +str(n_fold+1) + ' --> ' + str(acc_score)+'\n')

    
print(scores)
print('Avg. accuracy score :' + str(np.mean(scores)))


##Cross validation score 
scores = cross_val_score(rf, sonar_x, sonar_y, cv=10)

print(scores.mean())


Solution

  • You have a bug in your code that accounts for the gap. You are training over a folded set of trains, but evaluating against a fixed test.

    These two lines in the for loop:

    y_pred = rf.predict(x_test)
    
    acc_score = accuracy_score(y_test, y_pred)
    

    Should be:

    y_pred = rf.predict(x_valid)
    acc_score = accuracy_score(y_pred , y_valid)
    

    Since in your hand-written Cross-validation you are evaluating against a fixed x_test and y_test, for some folds there is a leakage that accounts for the overly optimistic result in the overall average.

    If you correct this, the values should come closer, because you are doing the same, conceptually speaking, as cross_val_score does.

    They might not match exactly though, due to randomness and the size of your dataset, which is quite small.

    Finally, if you just wanted to the get one test score, then the KFold part is not needed and you can do:

    x_train,x_test,y_train,y_test=train_test_split(sonar_x,sonar_y,test_size=0.33,random_state=0)
    rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
    rf.fit(x_train, y_train)  
    y_pred = rf.predict(x_test)    
    acc_score = accuracy_score(y_test, y_pred)
    

    This result is less robust than the cross-validated results, since you are splitting the dataset just once and therefore you can get better or worse results by chance, this is, depending on the difficulty of the train-test split that the random seed generated.