computer-vision pytorch object-detection faster-rcnn

Why does roi_align not seem to work in pytorch?

I am a pytorch beginner. It seems that there is a bug in the RoIAlign module in pytorch. The code is simple but the result is out of my expectation.

code:

import torch
from torchvision.ops import RoIAlign

if __name__ == '__main__':
    output_size = (3,3)
    spatial_scale = 1/4 
    sampling_ratio = 2  

    #x.shape:(1,1,6,6)
    x = torch.FloatTensor([[
        [[1,2,3,4,5,6],
        [7,8,9,10,11,12],
        [13,14,15,16,17,18],
        [19,20,21,22,23,24],
        [25,26,27,28,29,30],
        [31,32,33,34,35,36],],
    ]])

    rois = torch.tensor([
        [0,0.0,0.0,20.0,20.0],
    ])
    channel_num = x.shape[1]
    roi_num = rois.shape[0]

    a = RoIAlign(output_size, spatial_scale=spatial_scale, sampling_ratio=sampling_ratio)
    ya = a(x, rois)
    print(ya)

output:

tensor([[[[ 6.8333,  8.5000, 10.1667],
          [16.8333, 18.5000, 20.1667],
          [26.8333, 28.5000, 30.1667]]]])

But in this case shouldn't it be an average pooling operation on every 2x2 cell, like:

tensor([[[[ 4.5000,  6.5000, 8.5000],
          [16.5000, 18.5000, 20.5000],
          [28.5000, 30.5000, 32.5000]]]])

My torch version is 1.3.0 with python3.6 and cuda 10.1, on Ubuntu16. I have been troubled for two days and I couldn't appreciate it more if anyone could help me.

Solution

Intuitive Interpretation

There are some complications with image coordinates. We need to take into account the fact that pixels are actually squares and not points in space. We interpret the center of the pixel to be the integer coordinates, so for example (0,0) refers to the center of the first pixel while (-0.5, -0.5) refers to the upper left corner of the first pixel. Basically this is why you aren't getting the results you expect. An roi that goes from (0,0) to (5,5) actually cuts through the border pixels and leads to sampling between pixels when performing roi align. If instead we define our roi from (-0.5, -0.5) to (5.5, 5.5) then we get the expected result. Accounting for the scale factor this translates to an roi from (-2, -2) to (22, 22).

import torch
from torchvision.ops import RoIAlign

output_size = (3, 3)
spatial_scale = 1 / 4
sampling_ratio = 2  

x = torch.FloatTensor([[
    [[1,  2,  3,  4,  5,  6 ],
     [7,  8,  9,  10, 11, 12],
     [13, 14, 15, 16, 17, 18],
     [19, 20, 21, 22, 23, 24],
     [25, 26, 27, 28, 29, 30],
     [31, 32, 33, 34, 35, 36]]
]])

rois = torch.tensor([
    [0, -2.0, -2.0, 22.0, 22.0],
])

a = RoIAlign(output_size, spatial_scale=spatial_scale, sampling_ratio=sampling_ratio)
ya = a(x, rois)
print(ya)

which results in

tensor([[[[ 4.5000,  6.5000,  8.5000],
          [16.5000, 18.5000, 20.5000],
          [28.5000, 30.5000, 32.5000]]]])

Alternative interpretation

Partitioning the interval [0, 5] into 3 intervals of equal length gives [0, 1.67], [1.67, 3.33], [3.33, 5]. So the boundaries of the output window will fall into these coordinates. Clearly this won't lead to nice sampling results.