interface OpenCV's Mat containers with blas for matrix multiplication

I am processing UHD (2160 x 3840) images. One of the processing I do consist to process a Sobel filtering on X and Y axis then I have to multiply every output matrix by it's transpose and then I process the gradient image as the square root of the sum of the gradient.

So : S = sqrt( S_x * S_x^t + S_y * S_y^t).

Due to dimension of the image OpenCV take up to twenty seconds to process that without multithreading and ten with multithreading.

I know there OpenCV call OpenCL in order to speed up the filtering operations so I think it can take a long time in order to try to gain performance from the filtering step.

For the matrix multiplication I experience a kind of unstability from the OpenCV's OpenCL gemm kernel implementation.

So I would like to try to use OpenBLAS insted.

My questions are :

1.)

I wrote the following code but I face some issue for interface OpenCV's Mat objects :

template<class _Ty>
void mm(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
    static_assert(true,"support matrix_multiply is only defined for floating precision numbers.");
}

template<>
inline void mm<float>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
    const int M = A.rows;
    const int N = B.cols;
    const int K = A.cols;

    cblas_sgemm( CblasRowMajor ,// 1
                 CblasNoTrans, // 2 TRANSA
                 CblasNoTrans, // 3 TRANSB
                 M,       // 4 M
                 N,       // 5 N
                 K,       // 6 K
                 1.,           // 7 ALPHA
                 A.ptr<float>(),//8 A
                 A.rows,        //9 LDA
                 B.ptr<float>(),//10 B
                 B.rows,        //11 LDB
                 0.,            //12 BETA
                 C.ptr<float>(),//13 C
                 C.rows);       //14 LDC

}

template<>
inline void mm<double>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
    cblas_dgemm(CblasRowMajor,CblasNoTrans,CblasNoTrans,A.rows,B.cols,A.cols,1.,A.ptr<double>(),A.rows,B.ptr<double>(),B.cols,0.,C.ptr<double>(),C.rows);
}
    void matrix_multiply(cv::InputArray _src1, cv::InputArray _src2, cv::OutputArray _dst)
    {

        CV_DbgAssert(  (_src1.isMat() || _src1.isUMat()) && (_src1.kind() == _src2.kind()) &&
                      (_src1.depth() == _src2.depth()) && (_src1.depth() == CV_32F) && (_src1.depth() == _src1.type()) &&
                      (_src1.rows() == _src2.cols())
                       );


        cv::Mat src1 = _src1.getMat();
        cv::Mat src2 = _src2.getMat();
        cv::Mat dst;

        bool cpy(false);

        if(_dst.rows() == _src1.rows() && _dst.cols() == _src2.cols() && _dst.type() == _src1.type())
            dst = _dst.getMat();
        else
        {
            dst = cv::Mat::zeros(src1.rows,src2.cols,src1.type());
            cpy = true;
        }

        if(cpy)
            dst.copyTo(_dst);
    }

I tried to organize the datas as specified here : http://www.netlib.org/lapack/explore-html/db/dc9/group__single__blas__level3.html#gafe51bacb54592ff5de056acabd83c260

without succes. This is my main issue

2.) I was thinking in order to try to speed up a little my implementation to apply the divide and conquer approach illustrated here :

https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm

But for only four submatrix. Does any one tried some similar approach or got a better way to gain performance in matrix multiplication (without using GPU) ?

Thank you in advance for any help.

Solution

I found a solution to the question 1). I based my first implementation on the documentation of the BLAS library. BLAS has been written in Fortran language, in this language the index start at 1 and not at 0 like in C or C++. Another thing is many libraries wrote in Fortran language organize their memory in column order (e.g. BLAS,LAPACK) rather than most of the C or C++ library (e.g. OpenCV) organize the memory in row order.

After taking these two properties in count I modified my code to :

template<class _Ty>
void mm(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
    static_assert(true,"The function gemm is only defined for floating precision numbers.");
}

template<>
void mm<float>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
    const int M = A.cols+1;
    const int N = B.rows;
    const int K = A.cols;

    cblas_sgemm( CblasRowMajor ,// 1
                 CblasNoTrans, // 2 TRANSA
                 CblasNoTrans, // 3 TRANSB
                 M,       // 4 M
                 N,       // 5 N
                 K,       // 6 K
                 1.,           // 7 ALPHA
                 A.ptr<float>(),//8 A
                 A.step1(),        //9 LDA
                 B.ptr<float>(),//10 B
                 B.step1(),        //11 LDB
                 0.,            //12 BETA
                 C.ptr<float>(),//13 C
                 C.step1());       //14 LDC
}

template<>
void mm<double>(cv::Mat& A,cv::Mat& B,cv::Mat& C)
{
    const int M = A.cols+1;
    const int N = B.rows;
    const int K = A.cols;

    cblas_dgemm( CblasRowMajor ,// 1
                 CblasNoTrans, // 2 TRANSA
                 CblasNoTrans, // 3 TRANSB
                 M,       // 4 M
                 N,       // 5 N
                 K,       // 6 K
                 1.,           // 7 ALPHA
                 A.ptr<double>(),//8 A
                 A.step1(),        //9 LDA
                 B.ptr<double>(),//10 B
                 B.step1(),        //11 LDB
                 0.,            //12 BETA
                 C.ptr<double>(),//13 C
                 C.step1());       //14 LDC
}

And every thing work well. Without additional multithreading or divide and conquer approach I was able to reduce the processing time of one step of my code from 150 ms to 500 us. So it fix every thing for me :).