machine-learning artificial-intelligence cross-validation supervised-learning

When to use k-fold cross validation and when to use split percentage?

Which kind of dataset benefits the most from using k-fold validation? Is it usually a better option than standard split percentage?

Solution

The short answer is: small ones.

Longer version - you use k-fold splits (or bootstraps etc.) when a single, random sample of the data is not representative sample of the underlying distribution. Size of the dataset is just a heuristic, which tries to capture this phenomenon. The problem is - the more complex your distribution - the bigger is "big enough". Thus if your problem is a 2D classification, where you can fit nearly perfectly a linear model, then you probably can use a single random split even when you have just few hundreads points. On the other hand, if your data comes from extremely complex distribution, which violates iid assumptions etc, you will need lots of splits to recover reliable statistics.

So how to decide? In general - do k-fold cv if you can afford it (in terms of computational time). You will not harm your process this way. Another, more statistically sound approach is to gather multiple statistics of your data, for example - KDE of the marginal distributions (projections on each feature) of each split, and whole dataset and compare these three - if they are nearly the same, you are good to go with such split. If you can notice (either visually, or through statistical tests) that these distributions differ significantly - then you have to add k-fold cv (or other technique which reduces variance of the results).