deep-learning pytorch object-detection yolo

Better understanding of what is used to feed YOLO

I'm trying to construct a YOLO dataset. A read a lot about the dataset and understood many things :

pc is the confidence and it corresponds to the IoU between predicted and ground truth bboxes
there are C classes
there are 4 coordinates times the number of bounding boxes (here only 1).

What I do not understand is this quote drawn from the YOLO paper :

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.

I construct my bbox coordinates as many website and annotation solutions suggest, it leads to : cx, cy, rh, rw which are respectively the bbox center coordinates and the bbox height and width relative to the image (normalized by the image size). But this method doesn't seems to rely on the paper's method since cx and cy are note relative to a grid cell here.

I read the Hackernoon article but I don't understand how he sets his x and y : he mentions x=(220-149)/149 = 0.48 where 149 is his grid cell size and 220 is the absolute x coordinate of the bbox. But if there are more cells and one split the image by 6 for instance : 447/6 = 75, then x=(220-75)/75 = 1.93 which is not a relative value (greater than 1)...

My questions :

Does that mean that I had to take into account the grid size (and so the grid cell sizes) when I create my dataset ?
Do I need to include a pc in my training set ? And so, cut the image into cells and return a n*n matrice with 0 and 1 as a pc number ?

Thanks !

Solution

I post here a summary of what has been said and give more context about my initial questions.

Bounding box coordinates

First of all, as mentionned by @MichelKok : the initial YOLO model uses offsets relative to grid cell size (at the contrary, YOLOv5 does not employ the grid structure anymore).

In YOLOv1, the bounding box regression is applied into a grid cell. Therefore input coordinates have to be relative to one grid cell . In other words, if an image is "split" into 3x3 grid cells, the first left cell is (0,0) and the botom right is (2,2).

One can implement this as folow :

cell_size = width/S #or height/S
cx_rcell = cx % cell_size / cell_size
cy_rcell = cy % cell_size / cell_size

Where :

cx and cx are respectively x and y coordinates of the center of the bounding box and are relative to the image (i.e. normalized by the with or height).
cell_size is the size of one grid cell, defined by the number of grid cell in one direction S

Finally, one can go back to the image relative coordinates :

cx = cx_rcell * cell_size - (1/cell_size) * int(cx_rcell/cell_size)
cy = cy_rcell * cell_size - (1/cell_size) * int(cy_rcell/cell_size)

YOLOv1 Loss function

There are 4 parts into this loss :

minimizing the squared error of cx and cy coordinates
minimizing the root squared error of w and h
minimizing the squared error of the confidence numbers
minimizing the squared error of the class probabilities

Note that this loss is constructed in such a way that it penalizes errors depending if an object appears in the current grid cell. As I understood it, this means that one need to deal with bounding box coordinate training set as tensors of shape (N, S, S, 5) in order to get the information "In which cell grid the center of this training bounding box is ?".

A simple implementation could be :

bbox_tensor = torch.zeros(N, S, S, 4)
bbox_tensor[int(cy_rcell / cell_size), int(cx_rcell / cell_size)] = torch.Tensor([cx_rcell , cy_rcell, w, h])

Here, the bounding box coordinates will be located in the cell (i,j) in S^2.