I am training a model using a pytorch RNN model and have multiple csv files to train and infer. If I train file #1 and infer on file #1 I get ~100% accurate predictions. If I train on file #1 and infer on, say, file #4 or file #2 then accuracy drops to ~80%. Here’s what I am doing:
1. Read the file and separate the features (X) and labels (y) into two dataframes.
2. The range of my values, both features and labels, is high. So I apply scaling transformation.
3. Then I split data as train and test.
4. Instantiate model.train() and run train data through the rnn model.
5. Instantiate model.eval() and get the predictions from the model with the test data.
6. Reverse scale the predictions.
7. Calculate mean-square error.
So far this is all good. My MSE is very, very low which is good.
After training, I need to infer a randomly selected file. Here’s what I am doing for inference:
1. Read the single file and separate the features (X) and labels (y) into two dataframes.
2. Apply scaling transformation.
3. Instantiate model.eval().
4. Get the predictions.
5. Reverse scale the predictions
If the inference file is the same as the trained file accuracy is close to 100%. If I use a different file for inference why does the accuracy drop? Am I doing something wrong? Unfortunately, I cannot share the code due to confidentiality.
With the additional information provided in the comment, I would say it is most likely a problem with over-fitting, rather than any mistake in implementation.
Your model is learning the class distribution of file #1, which is then useful to predict the test set of file #1, but which does not translate to the other test sets.
To solve this, my suggestion would be to sample a training set from all the available files, such that it more closely resembles the distribution found in the collection of test sets, rather than a single test set.
Delving into other RNN over-fitting solutions might be worthwhile too.