How do I get the velocity of a tracked object without calibration?

I am using YoloV4 and Deepsort to detect and track people in a frame.

My goal is to get speed of a moving person in meaningful units without calibration as I would like to be able to move the camera with this model to different rooms without having to calibrate it each time. I understand this to be a very difficult problem. I am currently getting speed as pixels per second. But that is inaccurate as items closer to frame are "moving" faster.

My question is if I can use the bounding box of the person detection as a measurement of the size of a person in pixels and if I can average the size of a human being (say 68 inches height by 15 inches width) and have the necessary "calibration" metrics to determine in inches/s the object moved from Point A to Point B in the frame as a reflection of the size of the person from Region A to Region B?

In short, is there a way to get velocity from the size of an object to determine how fast it is moving in a frame?

Any suggestions would be helpful! Thanks!

This is how I am calculating speed now.


# # Calculate the center of the bounding box
xCenter = int((bbox[0] + bbox[2]) / 2)
yCenter = int((bbox[1] + bbox[3]) / 2)

# Get metrics from metrics {track_id : [[frames, xCenter, yCenter], [frames, xCenter, yCenter]] }
values = metrics[track_id]

# # calculate displacement, velocity and speed.
if len(values) > 1:
    delta_frames = values[-1][0] - values[-2][0]
    delta_t = delta_frames / fps     #fps = 30
    delta_x = values[-1][1] - values[-2][1]
    delta_y = values[-1][2] - values[-2][2]

    total_displacement = math.sqrt(delta_x ** 2 + delta_y ** 2)

    speed = total_displacement / delta_t

Solution

I think this is the answer I have been looking for.

I calculate the height and width of the bounding box. I get the pixels per inch of that bounding box by dividing it by the average human height and width. And then I sum the linspace() between the pixels per inch of Region A to the pixels per inch of Region B to get the distance. It's not very accurate though so maybe I can improve on that somehow.

Mainly the inaccuracies will come from the bounding box. It looks like top to bottom the bounding box is pretty good but left to right (width) it's not good as it's taking into account the arms and legs. I am going to see if I can use just a human head detection as a measurement.

# # Average human dimensions in inches
avg_person_width = 15
avg_person_height = 65

# # width, Height of the bounding box in pixels
bbox_width = bbox[2] - bbox[0]
bbox_height = bbox[3] - bbox[1]

# Pixels per inch within the bounding box
pixels_per_inch_width = bbox_width / avg_person_width
pixels_per_inch_height = bbox_height / avg_person_height

if track.track_id in metrics:
# append the new number to the existing array at this slot
    metrics[track_id].append([frame_idx, xCenter, yCenter, pixels_per_inch_width, pixels_per_inch_height])
else:
    # create a new array in this slot
    metrics[track_id] = [[frame_idx, xCenter, yCenter, pixels_per_inch_width, pixels_per_inch_height]]

values = metrics[track_id]

# # calculate displacement, velocity and speed.
if len(values) > 1:
    if all(values[-1]) and all(values[-2]):
        delta_frames = values[-1][0] - values[-2][0]
        delta_t = delta_frames / fps
        delta_x = values[-1][1] - values[-2][1]
        delta_y = values[-1][2] - values[-2][2]

        pixels_per_inch_width = values[-1][3]
        pixels_per_inch_height = values[-1][4]

        pixels_per_inch_width_2 = values[-2][3]
        pixels_per_inch_height_2 = values[-2][4]

        distance_x = np.linspace(pixels_per_inch_width, pixels_per_inch_width_2, abs(delta_x))
        distance_y = np.linspace(pixels_per_inch_height, pixels_per_inch_height_2, abs(delta_y))

        total_distance_x = sum(distance_x)
        total_distance_y = sum(distance_y)

        total_displacement = math.sqrt(total_distance_x ** 2 + total_distance_y ** 2)

        # # Inches / second (IPS)
        speed_ips = total_displacement / delta_t

        '''
        conversion: 1 inch per second (in/s) = 0.056818182 mile per hour (mph)
        '''
        # # Miles / Hour. Average human walks at < 3 mph
        speed_mph = speed_ips * 0.056818182