Search code examples
c++algorithmssesimdavx

Vectorization of modulo multiplication


I have a function:

void Func(const int * a, const int * b, size_t size, int p, int * c)
{
    for (size_t i = 0; i < size; ++i)
        c[i] = (a[i]*b[i])%p;
}

This function performs many modulo multiplication for arrays of integer. All integers are positive. And I need to improve its performance.

I thought about SSE and AVX. But they don't have an operation to vectorize modulo multiplication. Or maybe I'm wrong?

Maybe anybody know any posibility to solve this problem?


Solution

  • At first I want to note that modulo operation can be realized with using of float point numbers:

    d % p = d - int(float(d)/float(p))*p.
    

    Although the amount of operation in right part is greater then in left one this part is preferable because it can be vectorized with using of SSE/AVX.

    An implementation with SSE4.1 for 32x32 => 32-bit integer multiplication. Note that conversion from FP back to integer is done with round-to-nearest; use truncation toward zero (cvttps_epi32) if you want semantics like C float->integer conversions.

    void Func(const int * a, const int * b, size_t size, int p, int * c)
    {
        __m128 _k = _mm_set1_ps(1.0f / p);
        __m128i _p = _mm_set1_epi32(p);
        for (size_t i = 0; i < size; i += 4)
        {
            __m128i _a = _mm_loadu_si128((__m128i*)(a + i));
            __m128i _b = _mm_loadu_si128((__m128i*)(b + i));
            __m128i _d = _mm_mullo_epi32(_a, _b);
            __m128i _e = _mm_cvtps_epi32(_mm_mul_ps(_mm_cvtepi32_ps(_d), _k)); // e = int(float(d)/float(p));
            __m128i _c = _mm_sub_epi32(_d, _mm_mullo_epi32(_e, _p));
            _mm_storeu_si128((__m128i*)(c + i), _c);
        }            
    }
    

    An implementation with using of AVX :

    void Func(const int * a, const int * b, size_t size, int p, int * c)
    {
        __m256 _k = _mm256_set1_ps(1.0f / p);
        __m256i _p = _mm256_set1_epi32(p);
        for (size_t i = 0; i < size; i += 8)
        {
            __m256i _a = _mm256_loadu_si128((__m256i*)(a + i));
            __m256i _b = _mm256_loadu_si128((__m256i*)(b + i));
            __m256i _d = _mm256_mullo_epi32(_a, _b);
            __m256i _e = _mm256_cvtps_epi32(_mm256_mul_ps(_mm256_cvtepi32_ps(_d), _k)); // e = int(float(d)/float(p));
            __m256i _c = _mm256_sub_epi32(_d, _mm256_mullo_epi32(_e, _p));
            _mm256_storeu_si128((__m256i*)(c + i), _c);
        }            
    }