memset slow on 32-bit embedded platform

I am developing on an embedded device (STM32, ARM-Cortex M4) and expected memset and similar functions to be optimized for speed. However, I noticed much slower behavior than expected. I'm using GNU ARM embedded compiler/linker (arm-none-eabi-gcc, etc) with the -O3 optimization flag.

I looked into the disassembly and the memset function is writing one byte at a time and rechecking bounds at each iteration.

0x802e2c4 <memset>: add r2, r0
0x802e2c6 <memset+2>:   mov r3, r0
0x802e2c8 <memset+4>:   cmp r3, r2
0x802e2ca <memset+6>:   bne.n   0x802e2ce <memset+10>
0x802e2cc <memset+8>:   bx  lr
0x802e2ce <memset+10>:  strb.w  r1, [r3], #1
0x802e2d2 <memset+14>:  b.n 0x802e2c8

Naturally, this code could be sped up by using 32-bit writes and/or loop unrolling at the expense of code size. It is possible the implementers chose not to optimize this for speed in order to keep code size down.

The memset header and library are being included from:

C:\Program Files (x86)\GNU Tools Arm Embedded\7 2018-q2-update\arm-none-eabi\include\string.h
C:\Program Files (x86)\GNU Tools Arm Embedded\7 2018-q2-update\arm-none-eabi\include\c++\7.3.1\cmath

This question is similar to existing questions but is different in that it targets an embedded platform.

Is there an optimized memset readily available within the GNU ARM embedded package? If so how can I access it?

Solution

Link without -specs=nano.specs. This will use the version of the C library, which includes memset, that is optimized for speed instead of size. This will pull in larger versions of many other functions (usual suspects: printf and malloc), which could again be optimized by additional linker options. Examining the disassembly and linker map file will help.