I have a binary classification problem
First I train test split my data as:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
I checked the y_train and it had basically a 50/50 split of the two classes (1,0) which is how the dataset it
when I try a base model such as:
model = RandomForestClassifier()
model.fit(X_train, y_train)
model.score(X_train, y_train)
the output is 0.98
or something 1% different depending on the random state of the train test split.
HOWEVER, when I try a cross_val_score such as:
cross_val_score(model, X_train, y_train, cv=StratifiedKFold(shuffle=True), scoring='accuracy')
the output is
array([0.65 , 0.78333333, 0.78333333, 0.66666667, 0.76666667])
none of the scores in the array are even close to 0.98?
and when I tried scoring = 'r2' I got
>>>cross_val_score(model, X_train, y_train, cv=StratifiedKFold(shuffle=True), scoring='r2')
array([-0.20133482, -0.00111235, -0.2 , -0.2 , -0.13333333])
Does anyone know why this is happening? I have tried Shuffle
= True
and False
but it doesn't help.
Thanks in advance
In your base model, you compute your score on the training corpus. While this is a proper way to ensure your model has actually learnt from the data you fed it, it doesn't ensure the final accuracy of your model on new and unseen data.
Not 100% sure (I don't know well scikit-learn), but I'd expect cross_val_score
to actually split the X_train
and y_train
into a training and a testing set.
So as you compute a score on data unseen during the training, the accuracy will be much lower. Try to compare these results with model.score(X_test, y_test)
, it should be much closer.