Search code examples
cross-validationtrain-test-split

How to remove cross-validation with train_test_split?


My code:

X = data['text_with_tokeniz_lemmatiz'] y = data['toxic'] X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, train_size=0.8, test_size=0.2, shuffle=False, random_state=12345) X_valid, X_test, y_valid, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5, shuffle=False, random_state=12345)

The inspector wrote to me: "You use both validation sampling and cross-validation at the same time. It would be better to transfer the entire project to cross-validation and increase the amount of data in training."

How to fix it?

i dont know(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((


Solution

  • When using a validation dataset, we usually train a model on the training data and evaluate its performance on the validation data.

    Cross-validation is essentially the same thing, but done multiple times, with different splits.

    As your inspector suggests, it is not necessary to split the validation data yourself, as this is already done during cross-validation.

    It is hard to say how you can fix it when we don't see how you use the validation data in the code. From what I see, you first need to get rid of the validation data entirely, so the code would look like:

    X = data['text_with_tokeniz_lemmatiz']
    y = data['toxic']
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, test_size=0.1, shuffle=False, random_state=12345)
    

    If latter in the code you use validation data to measure the performance of your learning algorithm, you can replace that with cross-validation, for instance using Scikit Learn's cross_val_score.