SSE _mm_dp_ps size result

I am starting to do operations using SSE. I want to make two dot products with _mm_dp_ps and save the first result in aux_sse and the second in aux_sse. B is an 8-element vector of value 1.

Since I only need two floats for each pair, I have done the following code:

    printf("A  \n");
    for(i = 0; i < M; i++){
        for(j = 0; j < ele; j++){
            A[i*ele+j] = i*ele+j;
            printf(" %f ", A[i*ele+j]);
        }
        printf("\n");
    }
    
    float aux[ele*M];
    float aux2[ele*M];
    __m128 *A_sse = (__m128*) A;
    __m128 *B_sse = (__m128*) B;
    __m128 *aux_sse  = (__m128*) aux;
    __m128 *aux2_sse  = (__m128*) aux2;
    for(int i = 0; i < M; i++)
    {
        *aux_sse =  _mm_dp_ps (*A_sse, *B_sse,  0xFF);
        printf("%f \n", aux[i]);

        B_sse ++;
        A_sse++;
        *aux2_sse = _mm_dp_ps (*A_sse, *B_sse,  0xFF);
        printf("%f \n", aux2[i]);

        B_sse --;
        A_sse ++;

        aux_sse+= sizeof(char);
        aux2_sse+= sizeof(char);
    }

I get the following wrong output:

A  
 0.000000  1.000000  2.000000  3.000000  4.000000  5.000000  6.000000  7.000000 
 8.000000  9.000000  10.000000  11.000000  12.000000  13.000000  14.000000  15.000000 
6.000000 
22.000000 
6.000000 
22.000000

According this:

Conditionally multiply the packed single-precision (32-bit) floating-point elements in a and b using the high 4 bits in imm8, sum the four products, and conditionally store the sum in dst using the low 4 bits of imm8.

I understand that in imm8 we specify in the elements that we want the result to be saved.

As I understand, even if the result is in the 4 elements of the output vector, if I only make an increment of one element with aux_sse+= sizeof(char), the result should be overwritten and the desired result will come out. However, I see that this is not the case.

If I make the following modifications when I'm printing the result of aux and aux2 the output is correct.

printf("%f \n", aux[i*4]);
printf("%f \n", aux2[i*4]);

Output:

I am using gcc compiler. Does anyone know what the problem is? Any answer would be helpful.

EDIT:

I need the elements of aux and aux2 to correspond to each iteration:

aux[i] = dot_product performed in iteration i

Solution

aux_sse+= sizeof(char); is a nonsensical way to write aux_sse+=1, i.e. advance by 16 bytes, aka 4 floats, i.e. sizeof(__m128) == sizeof(*aux_sse) == 16.

So if you're also accessing the array via float indices, yes, you have to scale it by 4 if you only increment i by 1 for every vector of 4 floats.

Normally it's easier to use _mm_store_ps(&aux[i], v); instead of keeping track of __m128* variables to access the same arrays. And do i+=4 so i is actually indexing the start of the 4-element group you have, instead of needing to scale it. That makes it easier to write loop bounds like i < M-3.

Also note that you should use alignas(16) float aux[ele*M]; if you want to do alignment-required accesses to the arrays. GCC notices what you're doing and will align the arrays for you when it can see how they're used, but don't count on it in general.

Or did you want to just store a single float result, instead of 4 identical dot-products for every group of 4 inputs? In that case you should be extracting the low scalar element, e.g. _mm_store_ss (&aux[i], v). Or _mm_cvtss_f32(v) to get the low element of a vector as a scalar float.

If you want that, you can do the four 4-element dot products manually, producing 1 vector of 4 results. _mm_mul_ps and then maybe 2x _mm_hadd_ps (SSE3) to horizontally reduce, a sort of transpose and add. (Suggested by @mainactual)

dpps is 4 uops on Skylake and similar Intel CPUs (https://uops.info), so it's not great if you have multiple dot products to do.

To avoid SSE3, you'd use _mm_shuffle_ps (shufps) to pick elements from 2 vectors, or maybe some _mm_unpacklo_ps / unpackhi could be useful, or maybe the pd version to keep pairs of elements together.