Search code examples
opencvdeep-learningobject-detectionbounding-boxyolo

Where exactly does the bounding box start or end?


In OpenCV or object detection models, they represent bounding box as 4 numbers e.g. x,y,width,height or x1,y1,x2,y2.

These numbers seem to be ill-defined but it's fine when the resolution is big. But it causes me to think when the image has very low resolution e.g. 8x8, the one-pixel error can cause things to go very wrong.

So I want to know, what exactly does it mean when you say that a bounding box has x1=0, x2=100?

Specifically, I want to clear these confusions when understood well:

  • Does the bounding box border occupy the 0th pixel or is it surrounding 0th pixel (its border is at x=-1)?
  • Where is the exact end of the bounding box? If the image have shape=(8,8), would the end be at 7 or 8?
  • If you want to represent a bounding box that occupy the entire image, what should be its values?

So I think the right question should be, how do I think about bounding box intuitively so that these are not confusing for me?


Solution

  • OK. After many days working with bounding boxes, I have my own intuition on how to think about bounding box coordinates now.

    I divide coordinates in 2 categories: continuous and discrete. The mental problems usually arise when you try to convert between them.

    Suppose the image have width=100, height=100 then you can have a continuous point with x,y that can have any real value in the range [0,100].

    It means that points like (0,0), (0.5,7.1,39.83,99.9999) are valid points.

    Now you can convert a continuous point to a discrete point on the image by taking the floor of the number. E.g. (5.5, 8.9) gets mapped to pixel number (5,8) on the image. It's very important to understand that you should not use the ceiling or rounding operation to convert it to the discrete version. Suppose you have a continuous point (0.9,0.9) this point lies in the (0,0) pixel so it's closest to (0,0) pixel, not (1,1) pixel.

    From this foundation, let's try to answer my question:

    1. So I want to know, what exactly does it mean when you say that a bounding box has x1=0, x2=100?

      It means that the continuous point 1 has x value = 0, and continuous point 2, has x value = 100. Continuous point has zero size. It's not a pixel.

    2. Does the bounding box border occupy the 0th pixel or is it surrounding 0th pixel (its border is at x=-1)?

      In continuous-space, the bounding box border occupy zero space. The border is infinitesimally slim. But when we want to draw it onto an image, the border will have the size of at least 1 pixel thick. So if we have a continuous point (0,0), it will occupy 0th pixel of the image. But theoretically, it represents a slim border at the left side and top side of the 0th pixel.

    3. Where is the exact end of the bounding box? If the image have shape=(8,8), would the end be at 7 or 8?

      The biggest x,y value you can have is 7.999... but when converted to discrete version you will be left with 7 which represent the last pixel.

    4. If you want to represent a bounding box that occupy the entire image, what should be its values?

      You should represent bounding box coordinates in continuous space instead of discrete space because of the precision that you have. It means the largest bounding box starts at (0,0) and ends at (100,100). But if you want to draw this box, you need to convert it to discrete version and draws the bounding box at (0,0) and end at (99,99).