Search code examples
machine-learningclassificationdata-modelingpredict

Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)? (no time-series!)


TLDR:

This question is not about a classical ML time series analysis, but seeks to deal with monthly columns as mere features like any timeless features would do as well. I have shared this question since I had exactly this challenge at work, and in the end, the model worked fine with this setup, mixing up non-monthly (timeless) features with monthly features. Therefore, this is just a question of using monthly data columns as features, and the model does not care about whether it is December or June, it only cares about how many months these features lie in the past, so that it learns from the pattern of data over the time of some x months before. The features are not called by the monthly name, but just by how many months they lie back in time, like wealth_month_1, wealth_month_2 for the wealth of 1 or 2 months back in time.


I know the general rule that we should test a trained classifier only on the testing set.

But now comes the question: When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? Or do I have to apply it to a new predicting set that is different from the training+testing set?

And what if I predict a label column of a time series (edited later: I do not mean to create a classical time series analysis here, but just a broad selection of columns from a typical database, weekly, monthly or randomly stored data that I convert into separate feature columns, each for one week / month / year ...), do I have to shift all of the features (not just the past columns of the time series label column, but also all other normal features) of the training+testing set back to a point in time where the data has no "knowledge" interception with the predicting set?

I would then train and test the classifier on features shifted to the past by n months, scoring against a label column that is unshifted and most recent, and then predicting from most recent, unshifted features. Shifted and unshifted features have the same number of columns, I align shifted and unshifted features by assigning the column names of the shifted features to the unshifted features.

p.s.:

p.s.1: The general approach on https://en.wikipedia.org/wiki/Dependent_and_independent_variables

In data mining tools (for multivariate statistics and machine learning), the dependent variable is assigned a role as target variable (or in some tools as label attribute), while an independent variable may be assigned a role as regular variable.[8] Known values for the target variable are provided for the training data set and test data set, but should be predicted for other data.

p.s.2: In this basic tutorial we can see that the predicting set is made different: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

We select the training set with the [:-1] Python syntax, which produces a new array that contains all > but the last item from digits.data: […] Now you can predict new values. In this case, you’ll predict using the last image from digits.data [-1:]. By predicting, you’ll determine the image from the training set that best matches the last image.


Solution

  • The question above When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? has the simple answer: No.

    The question above Do I have to shift all of the features has the simple answer: Yes.

    In short, if I predict a month's class column: I have to shift all of the non-class columns also back in time in addition to the previous class months I converted to features, all data must have been known before the month in that the class is predicted.

    This also means: the predicting set has to be different from the dataset that contains the testing set. If you included the testing set, the training set loses valuable up-to-date data of the latest month(s) available! The term of a final "predicting set" is meant to be the "most current input to be used without a testing set" to get the "most current results" for the prediction.

    This is confirmed by the following overview offered by this user who seems to have made the image, using days instead of months here, but the idea is the same:

    enter image description here

    Source: Answer on "Cross Validated" - Splitting Time Series Data into Train/Test/Validation Sets; the whole Q/A is recommended (!).

    See the last line of the image and the valuable comments of that answer on "Cross Validated" to understand this. Mind that this question is not about a classical ML time series. This is just about how you would take a set of historical columns reaching up till now, and saved monthly, weekly, yearly or whatever, and make them features as any other timeless columns that get overwritten or do not change anyway. The picture will still be the same to explain this.

    230106:

    The image shows that the last step is a training on the whole dataset, this is the "predicting set" that is the newest and that does not have a testing set.

    On that image, there is one "mistake" which shows that this seemingly easy question of taking former labels as features for upcoming labels seems to be hard to be understood. I myself did not see this and posted the image without this remark: The "T&V" is in the past of the "Test". And that would be a wrong validation for a model that shall predict the future, the V must be in the "future" test block (unless you have a dataset that is not dynamically changing over time, like in physics).

    You would have to change it to a "walk-forward" model, with the validation set - if at all - split k-fold from the testing set, not from the training set. That would look like this:

    enter image description here

    See also: