Search code examples
machine-learningstatisticsdata-sciencecross-validation

Disadvantages of train-test split


"Train/test split does have its dangers — what if the split we make isn’t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? (imagine a file ordered by one of these). This will result in overfitting, even though we’re trying to avoid it! This is where cross validation comes in." The above is most of the blogs mentioned about which I don't understand that. I think the disadvantages is not overfitting but underfitting. When we split the data , assume State A and B become the training dataset and try to predict the State C which is completely different than the training data that will lead to underfitting. Can someone fill me in why most of the blogs state 'test-split' lead to overfitting.


Solution

  • It would be more correct to talk about selection bias, which your question describes.

    Selection bias can not really tie to overfitting, but to fitting a biased set, therefore the model will be unable to generalize/predict correctly.

    In other words, whether "fitting" or "overfitting" applies to a biased train set, that is still wrong.

    The semantic strain on the "over" prefix is just that. It implies bias.

    Imagine you have no selection bias. In that case, when you overfit even a healthy set, by definition of overfitting, you will still make the model biased towards your train set.

    Here, your starting training set is already biased. So any fitting, even "correct fitting", will be biased, just like it happens in overfitting.