tensorflow machine-learning keras tensorflow2.0 bounding-box

Keras data format for images and bounding boxes

I'm planning to train this neural network with my custom dataset. So far I understand I need the images and the bounding boxes coordinates. How can I load them if i have the images in .jpg or .png and I can get the bounding boxes from labelImg or similar. How can I feed it? In the tutorial they use tfds.load for it. Anyone could give me an explanation or some insight on how to prepare the data?

Solution

In general when you are working with raw data you need to create some helper functions which could help you get the data in desired format. You can use keras inbuilt utils functions to load images and store it in numpy array which then you can provide to the model.

def load_img(image_path):
    img = tf.keras.utils.load_img(image_path) # loads the image
    img = tf.keras.utils.img_to_array(img) # converts the image to numpy array
    return img

You can use the above function to load images also can add normalization or resizing functionality to the above function according to your need.

As for the bounding box most model take array of size 4 as bounding box input where array is equal to [xmin,ymin,xmax,ymax] where xmin and ymin are upper left coordinate of the box and xmax and y max are lower right coordinate of the box. You have to extract this information from the xml files which are provided when we use labelImg.

you can read this xml data like :

import xml.etree.ElementTree as et

root = et.parse(path_to_xml).getroot() # get the root of the xml
boxes = list()
for box in root.findall('.//object'):
label = box.find('name').text
xmin = int(box.find('./bndbox/xmin').text)
ymin = int(box.find('./bndbox/ymin').text)
xmax = int(box.find('./bndbox/xmax').text)
ymax = int(box.find('./bndbox/ymax').text)
data = np.array([xmin,ymin,xmax,ymax])
boxes.append(data)

this is an example how you could read the xml file. The above code works for this XML

So following this methods now you have two numpy arrays one which contains images in numerical format and the bounding boxes which are also in numerical array format and now you create a CNN + Regression header model and provide this data to the model and train it.