How to prevent vector operation optimization over this function with arm-none-eabi-gcc compiler?

My code

I am working with a simple code that uses this function in an academic project:

void calculateDistanceMatrix(const float data[M][N],
                             float distance[M][N]) {
    float sum = 0.0;
    for(int i = 0; i < M; i++) {
        for(int j = i+1; j < M; j++) {
            for(int k = 0; k < N; w++) {
                sum += (data[i][k] - data[j][k]) *
                       (data[i][k] - data[j][k]);
            }
            distance[i][j] = sum;
            distance[j][i] = sum;
            distance[i][i] = 0.0;
            sum = 0.0;
        }
    }
}

My target architecture

My code should perform no more than this simple matrix operation over 'data' and fill the 'distance' matrix with the results. In my academic project, however, I am interested in how the compiler optimizes these vector operations for the ARM architecture I am working with. The command line for the compilation contains the following:

arm-none-eabi-gcc <flags> <my_sources> -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=hard <more_flags>

My program is intended to be run in an embedded Xilinx Zynq-7000 device, whose architecture includes the NEON optimized instruction set for vector operations (described in this nice presentation)

My issue

I have to track the performance of the execution of the vector operations in the 'calculateDistanceMatrix' function with and without compiler optimizations. I notice the assembly output includes the shared NEON and VFP instructions for the vector load and store operations (detailed in ARM's Assembler Reference for Version 5.0):

ecf37a01    vldmia  r3!, {s15}
ecf26a01    vldmia  r2!, {s13}
e1530000    cmp r3, r0
ee777ae6    vsub.f32    s15, s15, s13
ee077aa7    vmla.f32    s14, s15, s15
1afffff9    bne 68 <calculateDistanceMatrix+0x48>
eca17a01    vstmia  r1!, {s14}

I couldn't find a way to compile this code such that these optimized instructions are not used.

Do you know any compilation configuration or code trick that could avoid these instructions? Appreciate any help on this issue.

Solution

I revisited this issue and found out that my environment was set to build in debug mode, thus no optimization was really taking place.

The actual optimized code uses the VLDM and VSTM instructions. They are not generated, however, when I add the pragma

#pragma GCC optimize ("O0")

in my source file.