Search code examples
pythonmachine-learningscikit-learncross-validation

Should Cross Validation Score be performed on original or split data?


When I want to evaluate my model with cross validation, should I perform cross validation on original (data thats not split on train and test) or on train / test data?

I know that training data is used for fitting the model, and testing for evaluating. If I use cross validation, should I still split the data into train and test, or not?

features = df.iloc[:,4:-1]
results = df.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

clf = LogisticRegression()
model = clf.fit(x_train, y_train)

accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)

Or should I do like this:

features = df.iloc[:,4:-1]
results = df.iloc[:,-1]

clf = LogisticRegression()
model = clf.fit(features, results)

accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)

Or maybe something different?


Solution

  • Both your approaches are wrong.

    • In the first one, you apply cross validation to the test set, which is meaningless

    • In the second one, you first fit the model with your whole data, and then you perform cross validation, which is again meaningless. Moreover, the approach is redundant (your fitted clf is not used by the cross_val_score method, which does its own fitting)

    Since you are not doing any hyperparameter tuning (i.e. you seem to be interested only in performance assessment), there are two ways:

    • Either with a separate test set
    • Or with cross validation

    First way (test set):

    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    
    x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
    
    clf = LogisticRegression()
    model = clf.fit(x_train, y_train)
    
    y_pred = clf.predict(x_test)
    
    accuracy_test = accuracy_score(y_test, y_pred)
    

    Second way (cross validation):

    from sklearn.model_selection import cross_val_score
    from sklearn.metrics import accuracy_score
    from sklearn.utils import shuffle
    
    clf = LogisticRegression()
    
    # shuffle data first:
    features_s, results_s = shuffle(features, results)
    accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')
    
    # fit the model afterwards with the whole data, if satisfied with the performance:
    model = clf.fit(features, results)