Fastest way to copy a blittable struct to an unmanaged memory location (IntPtr)

I have a function similar to the following:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void SetVariable<T>(T newValue) where T : struct {
    // I know by this point that T is blittable (i.e. only unmanaged value types)

    // varPtr is a void*, and is where I want to copy newValue to
    *varPtr = newValue; // This won't work, but is basically what I want to do
}

I saw Marshal.StructureToIntPtr(), but it seems quite slow, and this is performance-sensitive code. If I knew the type T I could just declare varPtr as a T*, but... Well, I don't.

Either way, I'm after the fastest possible way to do this. 'Safety' is not a concern: By this point in the code, I know that the size of the struct T will fit exactly in to the memory pointed to by varPtr.

Solution

One answer is to reimplement native memcpy instead in C#, making use of the same optimizing tricks that native memcpy attempts to do. You can see Microsoft doing this in their own source. See the Buffer.cs file in the Microsoft Reference Source:

     // This is tricky to get right AND fast, so lets make it useful for the whole Fx.
     // E.g. System.Runtime.WindowsRuntime!WindowsRuntimeBufferExtensions.MemCopy uses it.
     internal unsafe static void Memcpy(byte* dest, byte* src, int len) {

        // This is portable version of memcpy. It mirrors what the hand optimized assembly versions of memcpy typically do.
        // Ideally, we would just use the cpblk IL instruction here. Unfortunately, cpblk IL instruction is not as efficient as
        // possible yet and so we have this implementation here for now.

        switch (len)
        {
        case 0:
            return;
        case 1:
            *dest = *src;
            return;
        case 2:
            *(short *)dest = *(short *)src;
            return;
        case 3:
            *(short *)dest = *(short *)src;
            *(dest + 2) = *(src + 2);
            return;
        case 4:
            *(int *)dest = *(int *)src;
            return;
        ...

Its interesting to note that they natively implement memcpy for all sizes up to 512; most of the sizes use pointer aliasing tricks to get the VM to emit instructions that operate on differing sizes. Only at 512 do they finally drop into invoking the native memcpy:

        // P/Invoke into the native version for large lengths
        if (len >= 512)
        {
            _Memcpy(dest, src, len);
            return;
        }

Presumably, native memcpy is even faster since it can be hand optimized to use SSE/MMX instructions to perform the copy.