python deep-learning computer-vision object-detection

What is S in COCO Object Keypoint Similarity equation?

I am trying to understand Object Keypoint Similarity (OKS) in keypoint detection algorithm. However, based on the definition I cannot fully understand what is "S" in the equation means. Here is the equation (https://cocodataset.org/#keypoints-eval):

OKS = Σi[exp(-di2/2s2κi2)δ(vi>0)] / Σi[δ(vi>0)]

where "s" is mentioned as object scale. I dont know what does it really means. Can someone give more explanation? Thank you

Solution

What does the s rapresents

The s in the OKS equation represents the object scale, which is a measure of the size of the object that the keypoints belong to. Meaning that in keypoint detection tasks such as human pose estimation (one of the most common applications of keypoint detection), an "object" refers to the entity of interest in the image, in this case, a human being.

The "scale" of an object typically refers to its size in the context of the image. It's often represented as the area of the bounding box containing the object. For a human pose estimation task, if the bounding box around the person is large, it means that the person (or the object) appears larger in the image, thus having a larger scale. Conversely, if the bounding box is small, the person appears smaller in the image, and the scale is smaller.

The scale is critical when considering the distance between keypoints (like elbows, knees, eyes etc. on a human body). For a larger scale (big bounding box), the keypoints are generally farther apart because the person takes up more space in the image. For a smaller scale (small bounding box), the keypoints are closer together because the person appears smaller.

In the OKS equation, 's' is used to normalize the distances between the predicted and actual keypoints.

Why diving by it

By dividing by the scale, we account for the fact that an error of, say, five pixels is a lot more significant if the person is small in the image (and thus the keypoints are close together), than if the person is large in the image (and the keypoints are far apart).