Optimizing a memswap function

In a code review for the following implementation of memswap():

void util_memswap(size_t psize,
                  void *restrict p1, 
                  void *restrict p2)
{
    unsigned char *a = p1;
    unsigned char *b = p2;
    unsigned char tmp;

    while (psize--) {
        tmp = *a;
        *a++ = *b;
        *b++ = tmp;
    }
}

I received the following advice:

The implementation of util_memswap() could be improved for larger objects:
while (psize--) {
    tmp = *a;
    *a++ = *b;
    *b++ = tmp;
} 
You'll find that your library's version of memcpy() and/or memmove() probably shift larger units according to the target platform's architecture, resorting to byte-by-byte copies only at the ends (and possibly for unaligned data). A tuned implementation of swap would likely use similar optimisations.

It's unlikely that your compiler's optimiser would recognise the pattern here and group the transfers into larger units. I tried with latest GCC and Clang - the former did some loop unrolling, but both still generate code that accesses memory in single units as far I can read. @Toby Speight

And then searched around and found sample implementations of memcpy() that copy words instead of bytes after reinterpreting the void * as a long *. But to me, that seems like it'd break strict aliasing rules, as the void * may not originally have been a long *. Assuming my understanding is correct, what option do I have to optimize the above routine?

About replacing it with 3 memcpy() calls, @Peter Cordes says in the comments below this answer:

This will probably defeat some optimizations in most compilers. Many are pretty good at seeing through this kind of use of memcpy, but definitely not perfect. I would highly recommend not using this if you care about performance. At best this might make an #ifdef fallback for compilers that don't support GNU extensions, specifically typeof().

Edit: The above routine is being used like so:

#include <assert.h>
#include <string.h>

/**
 * Like C11's _Static_assert() except that it can be used in an expression.
 *
 * EXPR - The expression to check.
 * MSG  - The string literal of the error message to print only if EXPR evalutes
 *        to false.
 *
 * Always return true. */
#define STATIC_ASSERT_EXPR(EXPR, MSG)   \
    (!!sizeof( struct { static_assert ( (EXPR), MSG ); char c; } ))

/* OP is aware that this would not work for VLAs, or variables that are 
 * declared with the `register` storage class, or identical variables. */
#define SWAP(A, B)                                                         \
    util_memswap((sizeof *(1 ? &(A) : &(B))                                \
        * STATIC_ASSERT_EXPR(sizeof (A) == sizeof (B),                     \
        "Arguments of SWAP() must have same size and compatible types.")), \
        &(A),                                                              \
        &(B))

Solution

Unless you are using not very optimizing compiler, stick with the simple unsigned char implementation, and let the compiler perform optimizations.

You have linked an old answer with an old comment, remember that such advice sometimes get old, as compilers improve.

Peter is right, the compilers are indeed optimizing this case. I tried gcc, Clang, and MSVC on x86-64, only MSVC failed to optimize: https://godbolt.org/z/E5f4acze9 . Note that vmovups with ymm argument uses 256-bit registers, which is a bigger piece than 32-bit or 64-bit long. Also observe that there are no stack spills.
Peter is partly right that attempt to memcpy a larger type will inhibit a better optimization. See https://godbolt.org/z/GMfdPfsYG -- gcc switched to 64-bit registers (became worse), Clang is still using 256-bit registers.
If you make the element size constant, then it starts fully optimizing in gcc too https://godbolt.org/z/jqGcvE6YE (and also starts compiling in MSVC, which doesn't support VLA, but with only 64-bit registers)
You may also want to increase element size constant to vector register size. Now all three mentioned compilers fully optimize it: https://godbolt.org/z/7ooaE7oYE

But with larger element size you may also have to handle tails, if your size may be not a whole number of elements. Also you'll need to know vector register size. So, unless you are using not very optimizing compiler, stick with the simple unsigned char implementation, and let the compiler perform optimizations.

MSVC generally isn't strong in compiler optimization, and also it doesn't target C in particular, as opposed to C++. For C++ it has std::swap_ranges algorithm optimization implemented in the STL (not in the compiler).