I am developing on an embedded device (STM32, ARM-Cortex M4) and expected memset
and similar functions to be optimized for speed. However, I noticed much slower behavior than expected. I'm using GNU ARM embedded compiler/linker (arm-none-eabi-gcc
, etc) with the -O3
optimization flag.
I looked into the disassembly and the memset
function is writing one byte at a time and rechecking bounds at each iteration.
0x802e2c4 <memset>: add r2, r0
0x802e2c6 <memset+2>: mov r3, r0
0x802e2c8 <memset+4>: cmp r3, r2
0x802e2ca <memset+6>: bne.n 0x802e2ce <memset+10>
0x802e2cc <memset+8>: bx lr
0x802e2ce <memset+10>: strb.w r1, [r3], #1
0x802e2d2 <memset+14>: b.n 0x802e2c8
Naturally, this code could be sped up by using 32-bit writes and/or loop unrolling at the expense of code size. It is possible the implementers chose not to optimize this for speed in order to keep code size down.
The memset
header and library are being included from:
C:\Program Files (x86)\GNU Tools Arm Embedded\7 2018-q2-update\arm-none-eabi\include\string.h
C:\Program Files (x86)\GNU Tools Arm Embedded\7 2018-q2-update\arm-none-eabi\include\c++\7.3.1\cmath
This question is similar to existing questions but is different in that it targets an embedded platform.
Is there an optimized memset readily available within the GNU ARM embedded package? If so how can I access it?
Link without -specs=nano.specs
. This will use the version of the C library, which includes memset
, that is optimized for speed instead of size. This will pull in larger versions of many other functions (usual suspects: printf
and malloc
), which could again be optimized by additional linker options. Examining the disassembly and linker map file will help.