machine-learning computer-vision conv-neural-network face-recognition pytorch

Using Augmented Data Images in Testing

I am working on a Person Re-Identification problem and am showing the results using a CMC curve. I used augmented data/Image along with the normal images (Currently training on CUHK01) in the training set. While testing if I don't use the augmented data along with my normal test images for Calculating Rank let's say Rank_1 I get Rank_1 of ~30% on the other hand on using augmented data gives me a Rank_1 of ~65-70% (which is weirdly high regarding the current Rank_1 accuracy in the world).

So my questions are

a) How does augmented data affect the testing set especially in my case.

b) Am I over-fitting or something of that sort.

c) Is it a general rule to avoid usage of augmented images in the test case.

Solution

The reason behind using data augmentation is to reduce the chance of overfitting. This way you want to tell your model that the parameters (theta) are not correlated with the data that you are augmenting (alpha). That is achievable by augmenting each input by every possible alpha. But this is far from reality for a number of reasons, e.g. time/memory limitation, you might not be able to construct every possible augmentation, etc. so there might be some bias. Nevertheless, it still reduces the chance of overfitting to your dataset, but it might overfit to your augmentation.

Thus, if you have the augmentation, you might get more accuracy by matching to the augmented data due to the overfitting, which is an answer to question a. So I believe that the answer to the question b is yes.

In order to answer question c, I have not read about rules for data augmentation but in the literature of machine learning I presume that they avoid any augmentation on the test set. e.g. I quote from a paper

We augment the training images by replacing the green screen with random background images, and vary the appearance in terms of color and shading by intrinsic recoloring