python tensorflow machine-learning deep-learning object-detection

using vgg16 for bounding box prediction for own dataset

After building a vgg16 based classifier. I would like to build a bounding box which bound the detected object.

I found the internet that I can do that by removing the layer after the last Maxpool and add some fully connected layer

flatten = vgg16.output
flatten = Flatten()(flatten)
bboxhead = Dense(128,activation="relu")(flatten)
bboxhead = Dense(64,activation="relu")(bboxhead)
bboxhead = Dense(32,activation="relu")(bboxhead)
bboxhead = Dense(4,activation="relu")(bboxhead)
box_model = Model(inputs = vgg16.input,outputs = bboxhead)
box_model.summary()

The model should be like this, same as that I searched.

   Model: "box_model"
    _________________________________________________________________
     Layer (type)                Output Shape              Param #   
    

=================================================================
     input_1 (InputLayer)        [(None, 224, 224, 3)]     0         
                                                                  


 block1_conv1 (Conv2D)       (None, 224, 224, 64)      1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 224, 224, 64)      36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 112, 112, 64)      0         
                                                                 
 block2_conv1 (Conv2D)       (None, 112, 112, 128)     73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 112, 112, 128)     147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 56, 56, 128)       0         
                                                                 
 block3_conv1 (Conv2D)       (None, 56, 56, 256)       295168    
                                                                 
 block3_conv2 (Conv2D)       (None, 56, 56, 256)       590080    
                                                                 
 block3_conv3 (Conv2D)       (None, 56, 56, 256)       590080    
                                                                 
 block3_pool (MaxPooling2D)  (None, 28, 28, 256)       0         
                                                                 
 block4_conv1 (Conv2D)       (None, 28, 28, 512)       1180160   
                                                                 
 block4_conv2 (Conv2D)       (None, 28, 28, 512)       2359808   
                                                                 
 block4_conv3 (Conv2D)       (None, 28, 28, 512)       2359808   
                                                                 
 block4_pool (MaxPooling2D)  (None, 14, 14, 512)       0         
                                                                 
 block5_conv1 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_conv2 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_conv3 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_pool (MaxPooling2D)  (None, 7, 7, 512)         0         
                                                                 
 flatten (Flatten)           (None, 25088)             0         
                                                                 
 dense (Dense)               (None, 128)               3211392   
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 32)                2080      
                                                                 
 dense_3 (Dense)             (None, 4)                 132       
                                                                 
=================================================================
Total params: 17,936,548
Trainable params: 3,221,860
Non-trainable params: 14,714,688
_________________________________________________________________

Then train the model

from tensorflow.keras.optimizers import Adam

opt = Adam(1e-4)

box_model.compile(loss='mse',optimizer=opt)

steps, val_steps = train_gen.n/batch_size, val_gen.n/batch_size
num_epochs = 30

history = box_model.fit(train_gen,validation_data=val_gen,batch_size=32,epochs=30,verbose=1)

But I found that the last Dense layer has 4 dim, does not match my number of class (5). After I changed the dim to 5. It works, but I cannot train anything. The output 5-values array is not reasonable (all 0).

Or my implementation is not correct?

Solution

In short: your implementation is fine, but your data is wrong.

In order to train a new output, you need new labels. The input need not change, but somehow you need to acquire the x, y, height and width of the bounding box you are trying to detect. If the data set does not provide this, you will need to label them yourself.

If you want to train on bounding box coordinates, your label needs to be bounding box coordinates. You can't keep training with the class labels of your dataset. Whatever your model is trying to learn in supervised learning, that is what you need to supply as a label.