Are the restrictions of std::copy more relaxed than std::memcopy?

With regard to the issues copy vs. memcpy vs memmove^{(excellent info here, btw.)}, I have been reading up and it would seem to me, that unlike what is colloquially said, for example at cppreference_{Note: memcpy has been changed to memmove since taking this quote.} --

Notes

In practice, implementations of std::copy avoid multiple assignments and use bulk copy functions such as std::memcpy if the value type is TriviallyCopyable

-- std::copy (nor std::copy_backward) cannot be implemented in terms of memcopy, because for std::copy only the beginning of the destination range must not fall into the source range, but for memcpy the entirety of the ranges must not overlap.

Looking at Visual-C++'s implementation (see the xutility header), we can also observe that VC++ uses memmove, but that one now has more relaxed requirements than std::copy:

... The objects may overlap: copying takes place as if the characters were copied to a temporary character array and then the characters were copied from the array ...

So it would appear that implementing std::copy in terms of memcpy is not possible, but using memmove is actually a pessimization. (a wee tiny bit of pessimization, possibly not measurable, but still)

To come back to the question(s): Is my summary correct? Is this a problem anywhere? Regardless of what's specified, is there even a possible practical implementation of memcpy that would not also fulfill the requirements of std::copy, i.e. are there memcpy implementations that break when the ranges partially overlap as allowed by std::copy?

Solution

If the question is, whether it's possible to encounter an efficient memcpy implementation with enough undefined behavior to not trust it over overlapping ranges, then the answer is yes. :-)

Consider one possible implementation of memcpy on Power(PC) architecture: lmw instruction will load multiple consecutive words from memory into consecutive registers (which can be specified as a user defined range argument). stmw will then save the supplied register range back to memory. Thus, we are talking around ~100/200 bytes (32b/64b CPU) buffered by the CPU during a single memcpy iteration - plenty of data to spoil the target range if it overlaps with the source one, especially considering that CPU makes no promises about relative order of individual load and stores.