python machine-learning scikit-learn classification cross-validation

Sklearn cross_val_score gives significantly different number than model.score?

I have a binary classification problem

First I train test split my data as:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

I checked the y_train and it had basically a 50/50 split of the two classes (1,0) which is how the dataset it

when I try a base model such as:

model = RandomForestClassifier()
model.fit(X_train, y_train)
model.score(X_train, y_train)

the output is 0.98 or something 1% different depending on the random state of the train test split.

HOWEVER, when I try a cross_val_score such as:

cross_val_score(model, X_train, y_train, cv=StratifiedKFold(shuffle=True), scoring='accuracy')

the output is

array([0.65      , 0.78333333, 0.78333333, 0.66666667, 0.76666667])

none of the scores in the array are even close to 0.98?

and when I tried scoring = 'r2' I got

>>>cross_val_score(model, X_train, y_train, cv=StratifiedKFold(shuffle=True), scoring='r2')
array([-0.20133482, -0.00111235, -0.2       , -0.2       , -0.13333333])

Does anyone know why this is happening? I have tried Shuffle = True and False but it doesn't help.

Thanks in advance

Solution

In your base model, you compute your score on the training corpus. While this is a proper way to ensure your model has actually learnt from the data you fed it, it doesn't ensure the final accuracy of your model on new and unseen data.

Not 100% sure (I don't know well scikit-learn), but I'd expect cross_val_score to actually split the X_train and y_train into a training and a testing set.

So as you compute a score on data unseen during the training, the accuracy will be much lower. Try to compare these results with model.score(X_test, y_test), it should be much closer.