Just can't wrap my head around this one.
I understand that:
However, when we receive a new test set, we do have predictor variables available. What is wrong with using these to normalise? Yes, the predictors contain information that relates to the target variable, however that's literally the definition of predicting using a model, we use the information in predictors to get specific predictions for a target. Why can't it be built-in to the model definition that it uses input data to normalise, before predicting?
The performance metrics, surely, wouldn't be skewed as we are just using information from the predictors.
Yes, the test set is supposed to be 'unseen', but in reality, surely it's only the test set target variable that is unseen, not the predictors.
I have read around this and answers so far are vague, just repeating that test set is unseen and that we gain information about the test set. I would really appreciate an answer on why we can't use specifically the predictors, as I think the target case is obvious.
Thanks in advance!!
Having gone away and thought about my Q - normalising our data on the training set as well - I realise this doesn't make much sense. Normalising is not part of the training, but something we do before training, therefore normalising w/ test set features is fine as an idea, but we then would have to go train this normalised data on the training set outcomes. I originally thought "normalise on more data" > "normalise on less data" but actually we'd normalise on one set (training + test), then fit on another (training). Probably get a more poorly trained model as a result and so as I believe it's a stupid idea!