Search code examples
c++opencvcudagpuimage-stabilization

Window size limit in GPU accelerated LK pyramid


I am performing image stabilization on a real-time feed in order to run some vision algorithms on the stabilized images (emphasis on "real-time"). Currently this process, which uses the CPU-implemented version of the LK pyramids, is barely fast enough, even when building the pyramid beforehand (the reference image and "previous" features are only ever calculated once), but it needs to be scaled to work on images with about four times the resolution, which makes it too slow in the current implementation. I thought I might attempt to speed things up by incorporating the GPU since OpenCV has implemented the same LK approach for CUDA-capable devices, the cv::gpu::PyrLKOpticalFlow class. I'm using the ::sparse call with a set of previous features.

My main issue is that there seems to be a limit on the window size, and mine is too large. The limit occurs in the pyrlk.cpp file as an assertion:

CV_Assert(patch.x > 0 && patch.x < 6 && patch.y > 0 && patch.y < 6);

Where the patch dimensions are determined right above:

void calcPatchSize(cv::Size winSize, dim3& block, dim3& patch)
{
    if (winSize.width > 32 && winSize.width > 2 * winSize.height)
    {
        block.x = deviceSupports(FEATURE_SET_COMPUTE_12) ? 32 : 16;
        block.y = 8;
    }
    else
    {
        block.x = 16;
        block.y = deviceSupports(FEATURE_SET_COMPUTE_12) ? 16 : 8;
    }

    patch.x = (winSize.width  + block.x - 1) / block.x;
    patch.y = (winSize.height + block.y - 1) / block.y;

    block.z = patch.z = 1;
}

My problem is I need a window size of about 80x80 pixels, which is A. why I want to employ GPU acceleration and B. why that seems to not work in OpenCV. :) In addition, with the larger resolution images this window size will need to grow.

I'm not familiar with actually implementing GPU acceleration so I am wondering if someone can explain why this limitation exists in OpenCV, if it's a real limitation imposed by the hardware or by the OpenCV implementation, and if there are ways to work around it. It seems odd that this would be a hardware limitation, since these are the situations when you'd want to use a GPU. I can get reasonable speed with smaller search windows but the stabilization is not good enough for the application.

I need such a large search window size because I'm calculating the motion to the first (reference) frame. The motion is cyclical plus some small random drift so this method works well, but requires a bit more space to search at the peaks of the cycle when the matching features might be around 30-40 pixels away (at original resolution).

This is using OpenCV version 2.4.10 on Linux, built from source for CUDA support.

(This is a (somewhat modified) re-post from http://answers.opencv.org/question/54579/window-size-limit-in-gpu-accelerated-lk-pyramid/, but there doesn't seem to be much activity there so hopefully SO provides a better discussion environment!)


Solution

  • The patch size is passed to the CUDA kernel as a template parameter.

    See calling code at https://github.com/jet47/opencv/blob/master/modules/cudaoptflow/src/cuda/pyrlk.cu#L493:

    static const func_t funcs[5][5] =
    {
        {sparse_caller<1, 1, 1>, sparse_caller<1, 2, 1>, sparse_caller<1, 3, 1>, sparse_caller<1, 4, 1>, sparse_caller<1, 5, 1>},
        {sparse_caller<1, 1, 2>, sparse_caller<1, 2, 2>, sparse_caller<1, 3, 2>, sparse_caller<1, 4, 2>, sparse_caller<1, 5, 2>},
        {sparse_caller<1, 1, 3>, sparse_caller<1, 2, 3>, sparse_caller<1, 3, 3>, sparse_caller<1, 4, 3>, sparse_caller<1, 5, 3>},
        {sparse_caller<1, 1, 4>, sparse_caller<1, 2, 4>, sparse_caller<1, 3, 4>, sparse_caller<1, 4, 4>, sparse_caller<1, 5, 4>},
        {sparse_caller<1, 1, 5>, sparse_caller<1, 2, 5>, sparse_caller<1, 3, 5>, sparse_caller<1, 4, 5>, sparse_caller<1, 5, 5>}
    };
    

    where sparse_caller is declared as:

    template <int cn, int PATCH_X, int PATCH_Y>
    void sparse_caller(int rows, int cols, const float2* prevPts, float2* nextPts, 
                       uchar* status, float* err, int ptcount,
                       int level, dim3 block, cudaStream_t stream)
    

    The limitation for the patch size was done to reduce the number of template instantiations. You can increase this limitation for your need by modifying this code and adding more instantiations.