Why are complicated memcpy/memset superior?

When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset. These are usually implemented using streaming instructions if available, loop unrolled, alignment optimized, etc... I also recently encountered this 'bug' due to memcpy optimization in glibc.

The question is: why can't the hardware manufacturers (Intel, AMD) optimize the specific case of

rep stos

and

rep movs

to be recognized as such, and do the fastest fill and copy as possible on their own architecture?

Solution

One thing I'd like to add to the other answers is that rep movs is not actually slow on all modern processors. For instance,

Usually, the REP MOVS instruction has a large overhead for choosing and setting up the right method. Therefore, it is not optimal for small blocks of data. For large blocks of data, it may be quite efficient when certain conditions for alignment etc. are met. These conditions depend on the specific CPU (see page 143). On Intel Nehalem and Sandy Bridge processors, this is the fastest method for moving large blocks of data, even if the data are unaligned.

[Highlighting is mine.] Reference: Agner Fog, Optimizing subroutines in assembly language An optimization guide for x86 platforms. ,p. 156 (and see also section 16.10, p. 143) [version of 2011-06-08].