Why not use GDI to repeatedly fill a window with RGB data from an array?

This is a follow-up to this question. I'm currently writing a simple game and am looking for the fastest way to (repeatedly) display an array of RGB data in a Win32 window, without flickering or other artifacts.

Several different approaches were recommended in the answers to the previous question, but there was no consensus on which would be the fastest. So, I threw together a test program. The code simply displays a framebuffer on the screen repeatedly, as fast as possible.

These are the results I obtained, for 32-bit data running in a 32-bit video mode - they may surprise some people:

- Direct3D (1):             500 fps
- Direct3D (2):             650 fps
- DirectDraw (3):          1100 fps
- DirectDraw (4):           800 fps
- GDI (SetDIBitsToDevice): 2000 fps

Given these figures:

Why are many people adamant that GDI is simply too slow for this operation?
Is there any reason to prefer DirectDraw or Direct3D over SetDIBitsToDevice?

Here is a brief summary of the calls made by each of the Direct* codepaths. If anyone knows a more efficient way to use DirectDraw/Direct3D, please comment.

1. CreateTexture(D3DUSAGE_DYNAMIC, D3DPOOL_DEFAULT);
       LockRect(); memcpy(); UnlockRect(); DrawPrimitive()

2. CreateTexture(0, D3DPOOL_SYSTEMMEM); CreateTexture(0, D3DPOOL_DEFAULT);
       LockRect(); memcpy(); UnlockRect(); UpdateTexture(); DrawPrimitive()

3. CreateSurface(); SetSurfaceDesc(lpSurface = &frameBuffer[0]);
       memcpy(); primarySurface->Blt();

4. CreateSurface();
       Lock(); memcpy(); Unlock(); primarySurface->Blt();

Solution

There are a couple of things to keep in mind here. First of all, a lot of "common knowledge" is based on some facts that no longer really apply.

In the days of AGP, when the CPU talked directly to the GPU, it always used the base PCI protocol, which happened at the "1x" rate (always and inevitably). AGX 2x/4x/8x only applied when the GPU was taking to the memory controller directly. In other words, depending on when you looked, it was up to 8 times as fast to have the GPU load a texture from memory as it was for the CPU to send the same data directly to the GPU. Of course, the CPU also had a great deal more bandwidth to memory than the PCI bus supported.

When things switched to PCI-E, however, that changed completely. While there can be differences in bandwidth depending on path, there's no general rule that memory->GPU will be faster than CPU->GPU. The one generalization that's (mostly) safe is that if you have a dedicated graphics card, then the GPU will almost always have more bandwidth to the memory on the graphics card than it does to main memory on the motherboard.

In your case, that doesn't matter much though -- you're talking about moving data from CPU space to GPU space regardless. The main speed difference with using DirectX (or OpenGL) happens when you keep all (or most) of the computation on the GPU, and avoid using the CPU (or main memory) at all. They don't (now that AGP is history) provide any substantial improvement in memory->display bandwidth.