I am working with a simple code that uses this function in an academic project:
void calculateDistanceMatrix(const float data[M][N],
float distance[M][N]) {
float sum = 0.0;
for(int i = 0; i < M; i++) {
for(int j = i+1; j < M; j++) {
for(int k = 0; k < N; w++) {
sum += (data[i][k] - data[j][k]) *
(data[i][k] - data[j][k]);
}
distance[i][j] = sum;
distance[j][i] = sum;
distance[i][i] = 0.0;
sum = 0.0;
}
}
}
My code should perform no more than this simple matrix operation over 'data' and fill the 'distance' matrix with the results. In my academic project, however, I am interested in how the compiler optimizes these vector operations for the ARM architecture I am working with. The command line for the compilation contains the following:
arm-none-eabi-gcc <flags> <my_sources> -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=hard <more_flags>
My program is intended to be run in an embedded Xilinx Zynq-7000 device, whose architecture includes the NEON optimized instruction set for vector operations (described in this nice presentation)
I have to track the performance of the execution of the vector operations in the 'calculateDistanceMatrix' function with and without compiler optimizations. I notice the assembly output includes the shared NEON and VFP instructions for the vector load and store operations (detailed in ARM's Assembler Reference for Version 5.0):
ecf37a01 vldmia r3!, {s15}
ecf26a01 vldmia r2!, {s13}
e1530000 cmp r3, r0
ee777ae6 vsub.f32 s15, s15, s13
ee077aa7 vmla.f32 s14, s15, s15
1afffff9 bne 68 <calculateDistanceMatrix+0x48>
eca17a01 vstmia r1!, {s14}
I couldn't find a way to compile this code such that these optimized instructions are not used.
Do you know any compilation configuration or code trick that could avoid these instructions? Appreciate any help on this issue.
I revisited this issue and found out that my environment was set to build in debug mode, thus no optimization was really taking place.
The actual optimized code uses the VLDM and VSTM instructions. They are not generated, however, when I add the pragma
#pragma GCC optimize ("O0")
in my source file.