I'm trying to construct a YOLO dataset. A read a lot about the dataset and understood many things :
What I do not understand is this quote drawn from the YOLO paper :
Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
I construct my bbox coordinates as many website and annotation solutions suggest, it leads to : cx, cy, rh, rw which are respectively the bbox center coordinates and the bbox height and width relative to the image (normalized by the image size). But this method doesn't seems to rely on the paper's method since cx and cy are note relative to a grid cell here.
I read the Hackernoon article but I don't understand how he sets his x and y : he mentions x=(220-149)/149 = 0.48
where 149 is his grid cell size and 220 is the absolute x coordinate of the bbox. But if there are more cells and one split the image by 6 for instance : 447/6 = 75
, then x=(220-75)/75 = 1.93
which is not a relative value (greater than 1)...
Thanks !
I post here a summary of what has been said and give more context about my initial questions.
First of all, as mentionned by @MichelKok : the initial YOLO model uses offsets relative to grid cell size (at the contrary, YOLOv5 does not employ the grid structure anymore).
In YOLOv1, the bounding box regression is applied into a grid cell. Therefore input coordinates have to be relative to one grid cell . In other words, if an image is "split" into 3x3 grid cells, the first left cell is (0,0) and the botom right is (2,2).
One can implement this as folow :
cell_size = width/S #or height/S
cx_rcell = cx % cell_size / cell_size
cy_rcell = cy % cell_size / cell_size
Where :
cx
and cx
are respectively x
and y
coordinates of the center of the bounding box and are relative to the image (i.e. normalized by the with or height).cell_size
is the size of one grid cell, defined by the number of grid cell in one direction S
Finally, one can go back to the image relative coordinates :
cx = cx_rcell * cell_size - (1/cell_size) * int(cx_rcell/cell_size)
cy = cy_rcell * cell_size - (1/cell_size) * int(cy_rcell/cell_size)
There are 4 parts into this loss :
Note that this loss is constructed in such a way that it penalizes errors depending if an object appears in the current grid cell. As I understood it, this means that one need to deal with bounding box coordinate training set as tensors of shape (N, S, S, 5) in order to get the information "In which cell grid the center of this training bounding box is ?".
A simple implementation could be :
bbox_tensor = torch.zeros(N, S, S, 4)
bbox_tensor[int(cy_rcell / cell_size), int(cx_rcell / cell_size)] = torch.Tensor([cx_rcell , cy_rcell, w, h])
Here, the bounding box coordinates will be located in the cell (i,j) in S^2.