machine-learning cross-validation train-test-split

Order between using validation, training and test sets

I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used.

Let's say I have a dataset and I want to use linear regression. I am hesitating among various polynomial degrees (hyper-parameters).

In this wikipedia article, it seems to imply that the sequence should be:

Split data into training set, validation set and test set
Use the training set to fit the model (find the best parameters: coefficients of the polynomial).
Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")
Finally, use the test set to score the model fitted with the training set.

However, this seems strange to me: how can you fit your model with the training set if you haven't chosen yet your hyper-parameters (polynomial degree in this case)?

I see three alternative approachs, I am not sure if they would be correct.

First approach

Split data into training set, validation set and test set
For each polynomial degree, fit the model with the training set and give it a score using the validation set.
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set

Second approach

Split data into training set, validation set and test set
For each polynomial degree, use cross-validation only on the validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training set.
Evaluate with the test set

Third approach

Split data into only two sets: the training/validation set and the test set
For each polynomial degree, use cross-validation only on the training/validation set to fit and score the model
For the polynomial degree with the best score, fit the model with the training/validation set.
Evaluate with the test set

So the question is:

Is the wikipedia article wrong or am I missing something?
Are the three approaches I envisage correct? Which one would be preferrable? Would there be another approach better than these three?

Solution

What Wikipedia means is actually your first approach.

1 Split data into training set, validation set and test set

2 Use the training set to fit the model (find the best parameters: coefficients of the polynomial).

That just means that you use your training data to fit a model.

3 Afterwards, use the validation set to find the best hyper-parameters (in this case, polynomial degree) (wikipedia article says: "Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset")

That means that you use your validation dataset to predict its values with the previously (on the training set) trained model to get a score of how good your model performs on unseen data.

You repeat step 2 and 3 for all hyperparameter combinations you want to look at (in your case the different polynomial degrees you want to try) to get a score (e.g. accuracy) for every hyperparmeter combination.

Finally, use the test set to score the model fitted with the training set.

Why you need the validation set is pretty well explained in this stackexchange question https://datascience.stackexchange.com/questions/18339/why-use-both-validation-set-and-test-set

In the end you can use any of your three aproaches.

approach:

is the fastest because you only train one model for every hyperparameter. also you don't need as much data as for the other two.
approach:

is slowest because you train for k folds k classifiers plus the final one with all your training data to validate it for every hyperparameter combination.

You also need a lot of data because you split your data three times and that first part again in k folds.

But here you have the least variance in your results. Its pretty unlikely to get k good classifiers and a good validation result by coincidence. That could happen more likely in the first approach. Cross Validation is also way more unlikely to overfit.
approach:

is in its pros and cons in between of the other two. Here you also have less likely overfitting.

In the end it will depend on how much data you have and if you get into more complex models like neural networks, how much time/calculationpower you have and are willing to spend.

Edit As @desertnaut mentioned: Keep in mind that you should use training- and validationset as training data for your evaluation with the test set. Also you confused training with validation set in your second approach.