python tensorflow keras conv-neural-network kaggle

Kaggle: Dealing with extra unlabelled test data in CNN

I'm doing a kaggle competition and I've got extra test data that I don't have labels for.

I have a train.txt file which has the format

train/0.jpg 5
train/1.jpg 1
train/2.jpg 10
train/3.jpg 2
train/4.jpg 22
train/5.jpg 3
etc...

So image 0.jpg is of class 5 for example.

This continues to train/10259.jpg

I then assign these labels to my train data and then my test data, so they become.

0.jpg -> 5.0.jpg
2.jpg -> 10.2.jpg

10259 is the size of my train dataset. Therefore, I have all the labels the training set.

I then do the same with the /test folder. However, I've got more test images than train, therefore there are some test images that I don't have labels for.

I'm using Keras ImageDataGenerator() and I've sorted my classes into folders like so:

In my test dataset, because I don't have labels for some of the data. It's similar to the image above, but there are images that haven't been put into their class folders.

I'm unsure what to do with this unlabelled test data. Will it be fine leaving them as they are? Or should I split them into another set?

Solution

Unless you label the data yourself (by hand) or have another (superior) model at hand, there is not much you can do with the unlabeled "test" data.

The idea of test data is to compare the predicted results with the true labels - if you don't have them, you should discard the data from the test set.