tensorflow annotations tensorflow-datasets object-detection-api tfrecord

Conversion of image annotation formats into tfrecords for tensorflow object detection api

Seeking help regarding image annotation formats for object detection API.

Foreknow:

As, we know there are two annotation formats for images, Pascal VOC and COCO formats. Both have their own specification here's the main difference between both:

Pascal VOC:

Stores annotation in .xml file format.
Bounding box format [x-top-left, y-top-left, x-bottom-right, y-bottom-right]
Create separate xml annotation file for each image in the dataset.

COCO:

Stores annotation in .json file format.
Bounding box format [x-top-left, y-top-left, width, height].
Create one annotation file for each training, testing and validation.

Current-issue:

I have two dataset to deal and this is how they are annotated.

Dataset-1:

File format: Pascal VOC(.xml)
Bounding box format: COCO.
File creation: As in Pascal VOC(separate xml annotation file for each image in the dataset).

Dataset-2:

File format: Pascal VOC(.xml)
Bounding box format: COCO.
File creation: As in COCO(Create one annotation file for each training, testing and validation)

The thing that I am not able to get pass through is which format(Pascal VOC or COCO) should I follow to convert my annotations into Tfrecords(.xml to .records) as use can see the annotations of dataset aren't purely belong to any of one format.

For instance, in this link the author wrote a script to convert .xml into .records but here it is dealing with pure pascal VOC format.

And in this link they are dealing with pure COCO annotation formats.

Which path should I follow as I am standing in the middle of both formats?

Solution

Which path should I follow as I am standing in the middle of both formats?

Use Pascal VOC format for conversion of .xml into .records.

Make the following changes in a create_tf_example function of this link

for index, row in group.TextLine.iterrows():
xmin.append(row['X']/imgwidth)
xmax.append((row['X']+row['Width'])/imgwidth)
ymin.append(row['Y']/imgheight)
ymax.append((row['Y']+row['Height'])/imgheight)
classes_text.append(row['class'].encode('utf8'))
classes.append(class_text_to_int(row['class']))'

In case where you have X, Y, Width, Height in your .xml annotations instead of xmin, ymin, xmax, ymax.