GDI+ DrawImage notably slower in C++ (Win32) than in C# (WinForms)

I am porting an application from C# (WinForms) to C++ and noticed that drawing an image using GDI+ is much slower in C++, even though it uses the same API.

The image is loaded at application startup into a System.Drawing.Image or Gdiplus::Image, respectively.

The C# drawing code is (directly in the main form):

public Form1()
{
    this.SetStyle(ControlStyles.UserPaint | ControlStyles.AllPaintingInWmPaint | ControlStyles.OptimizedDoubleBuffer, true);
    this.image = Image.FromFile(...);
}

private readonly Image image;

protected override void OnPaint(PaintEventArgs e)
{
    base.OnPaint(e);
    var sw = Stopwatch.StartNew();
    e.Graphics.TranslateTransform(this.translation.X, this.translation.Y); /* NOTE0 */
    e.Graphics.DrawImage(this.image, 0, 0, this.image.Width, this.image.Height);
    Debug.WriteLine(sw.Elapsed.TotalMilliseconds.ToString()); // ~3ms
}

Regarding SetStyle: AFAIK, these flags (1) make WndProc ignore WM_ERASEBKGND, and (2) allocate a temporary HDC and Graphics for double buffered drawing.

The C++ drawing code is more bloated. I have browsed the reference source of System.Windows.Forms.Control to see how it handles HDC and how it implements double buffering.

As far as I can tell, my implementation matches that closely (see NOTE1) (note that I implemented it in C++ first and then looked at how it's in the .NET source -- I may have overlooked things). The rest of the program is more or less what you get when you create a fresh Win32 project in VS2019. All error handling omitted for readability.

// In wWinMain:
    Gdiplus::GdiplusStartupInput gdiplusStartupInput;
    Gdiplus::GdiplusStartup(&gdiplusToken, &gdiplusStartupInput, NULL);
    gdip_bitmap = Gdiplus::Image::FromFile(...);

// In the WndProc callback:
case WM_PAINT:
    // Need this for the back buffer bitmap
    RECT client_rect;
    GetClientRect(hWnd, &client_rect);
    int client_width = client_rect.right - client_rect.left;
    int client_height = client_rect.bottom - client_rect.top;

    // Double buffering
    HDC hdc0 = BeginPaint(hWnd, &ps);
    HDC hdc = CreateCompatibleDC(hdc0);
    HBITMAP back_buffer = CreateCompatibleBitmap(hdc0, client_width, client_height); /* NOTE1 */
    HBITMAP dummy_buffer = (HBITMAP)SelectObject(hdc, back_buffer);

    // Create GDI+ stuff on top of HDC
    Gdiplus::Graphics *graphics = Gdiplus::Graphics::FromHDC(hdc);

    QueryPerformanceCounter(...);
    graphics->DrawImage(gdip_bitmap, 0, 0, bitmap_width, bitmap_height);
    /* print performance counter diff */ // -> ~27 ms typically

    delete graphics;

    // Double buffering
    BitBlt(hdc0, 0, 0, client_width, client_height, hdc, 0, 0, SRCCOPY);
    SelectObject(hdc, dummy_buffer);
    DeleteObject(back_buffer);
    DeleteDC(hdc); // This is the temporary double buffer HDC

    EndPaint(hWnd, &ps);

/* NOTE1 */: In the .NET source code they don't use CreateCompatibleBitmap, but CreateDIBSection instead. That improves performance from 27 ms to 21 ms and is very cumbersome (see below).

In both cases I am calling Control.Invalidate or InvalidateRect, respectively, when the mouse moves (OnMouseMove, WM_MOUSEMOVE). The goal is to implement panning with the mouse using SetTransform - that's irrelevant for now as long as draw performance is bad.

NOTE2: https://stackoverflow.com/a/1617930/653473

This answer suggests that using Gdiplus::CachedBitmap is the trick. However, I can find no evidence in the C# WinForms source code that it makes use of cached bitmaps in any way - the C# code uses GdipDrawImageRectI which maps to GdipDrawImageRectI, which maps to Graphics::DrawImage(IN Image* image, IN INT x, IN INT y, IN INT width, IN INT height).

Regarding /* NOTE1 */, here is the replacement for CreateCompatibleBitmap (just substitute CreateVeryCompatibleBitmap):

bool bFillBitmapInfo(HDC hdc, BITMAPINFO *pbmi)
{
    HBITMAP hbm = NULL;
    bool bRet = false;

    // Create a dummy bitmap from which we can query color format info about the device surface.
    hbm = CreateCompatibleBitmap(hdc, 1, 1);

    pbmi->bmiHeader.biSize = sizeof(BITMAPINFOHEADER);

    // Call first time to fill in BITMAPINFO header.
    GetDIBits(hdc, hbm, 0, 0, NULL, pbmi, DIB_RGB_COLORS);

    if ( pbmi->bmiHeader.biBitCount <= 8 ) {
        // UNSUPPORTED
    } else {
        if ( pbmi->bmiHeader.biCompression == BI_BITFIELDS ) {
            // Call a second time to get the color masks.
            // It's a GetDIBits Win32 "feature".
            GetDIBits(hdc, hbm, 0, pbmi->bmiHeader.biHeight, NULL, pbmi, DIB_RGB_COLORS);
        }
        bRet = true;
    }

    if (hbm != NULL) {
        DeleteObject(hbm);
        hbm = NULL;
    }
    return bRet;
}

HBITMAP CreateVeryCompatibleBitmap(HDC hdc, int width, int height)
{
    BITMAPINFO *pbmi = (BITMAPINFO *)LocalAlloc(LMEM_ZEROINIT, 4096); // Because otherwise I would have to figure out the actual size of the color table at the end; whatever...
    bFillBitmapInfo(hdc, pbmi);
    pbmi->bmiHeader.biWidth = width;
    pbmi->bmiHeader.biHeight = height;
    if (pbmi->bmiHeader.biCompression == BI_RGB) {
            pbmi->bmiHeader.biSizeImage = 0;
    } else {
        if ( pbmi->bmiHeader.biBitCount == 16 )
            pbmi->bmiHeader.biSizeImage = width * height * 2;
        else if ( pbmi->bmiHeader.biBitCount == 32 )
            pbmi->bmiHeader.biSizeImage = width * height * 4;
        else
            pbmi->bmiHeader.biSizeImage = 0;
    }
    pbmi->bmiHeader.biClrUsed = 0;
    pbmi->bmiHeader.biClrImportant = 0;

    void *dummy;
    HBITMAP back_buffer = CreateDIBSection(hdc, pbmi, DIB_RGB_COLORS, &dummy, NULL, 0);
    LocalFree(pbmi);
    return back_buffer;
}

Using a very compatible bitmap as the back buffer improves performance from 27 ms to 21 ms.

Regarding /* NOTE0 */ in the C# code -- the code is only fast if the transformation matrix doesn't scale. C# performance drops slightly when upscaling (~9ms), and drops significantly (~22ms) when downsampling.

This hints to: DrawImage probably wants to BitBlt if possible. But it can't in my C++ case because the Bitmap format (that was loaded from disk) is different from the back buffer format or something. If I create a new more compatible bitmap (this time no clear difference between CreateCompatibleBitmap and CreateVeryCompatibleBitmap), and then draw the original bitmap onto that, and then only use the more compatible bitmap in the DrawImage call, then performance increases to about 4.5 ms. It also has the same performance characteristics when scaling now as the C# code.

if (better_bitmap == NULL)
{
    HBITMAP tmp_bitmap = CreateVeryCompatibleBitmap(hdc0, gdip_bitmap->GetWidth(), gdip_bitmap->GetHeight());
    HDC copy_hdc = CreateCompatibleDC(hdc0);
    HGDIOBJ old = SelectObject(copy_hdc, tmp_bitmap);
    Gdiplus::Graphics *copy_graphics = Gdiplus::Graphics::FromHDC(copy_hdc);
    copy_graphics->DrawImage(gdip_bitmap, 0, 0, gdip_bitmap->GetWidth(), gdip_bitmap->GetHeight());
    // Now tmp_bitmap contains the image, hopefully in the device's preferred format
    delete copy_graphics;
    SelectObject(copy_hdc, old);
    DeleteDC(copy_hdc);
    better_bitmap = Gdiplus::Bitmap::FromHBITMAP(tmp_bitmap, NULL);
}

BUT it's still consistently slower, there must be something missing still. And it raises a new question: Why is this not necessary in C# (same image and same machine)? Image.FromFile does not convert the bitmap format on loading as far as I can tell.

Why is the DrawImage call in the C++ code still slower, and what do I need to do to make it as fast as in C#?

Solution

I ended up replicating more of the .NET code insanity.

The magic call that makes it go fast is GdipImageForceValidation in System.Drawing.Image.FromFile. This function is basically not documented at all, and it is not even [officially] callable from C++. It is merely mentioned here: https://learn.microsoft.com/en-us/windows/win32/gdiplus/-gdiplus-image-flat

Gdiplus::Image::FromFile and GdipLoadImageFromFile don't actually load the full image into memory. It effectively gets copied from the disk every time it is being drawn. GdipImageForceValidation forces the image to be loaded into memory, or so it seems...

My initial idea of copying the image into a more compatible bitmap was on the right track, but the way I did it does not yield the best performance for GDI+ (because I used a GDI bitmap from the original HDC). Loading the image directly into a new GDI+ bitmap, regardless of pixel format, yields the same performance characteristics as seen in the C# implementation:

better_bitmap = new Gdiplus::Bitmap(gdip_bitmap->GetWidth(), gdip_bitmap->GetHeight(), PixelFormat24bppRGB);
Gdiplus::Graphics *graphics = Gdiplus::Graphics::FromImage(better_bitmap);
graphics->DrawImage(gdip_bitmap, 0, 0, gdip_bitmap->GetWidth(), gdip_bitmap->GetHeight());
delete graphics;

Even better yet, using PixelFormat32bppPARGB further improves performance substantially - the premultiplied alpha pays off when the image is repeatedly drawn (regardless of whether the source image has an alpha channel).

It seems calling GdipImageForceValidation effectively does something similar internally, although I don't know what it really does. Because Microsoft made it as impossible as they could to call the GDI+ flat API from C++ user code, I just modified Gdiplus::Image in my Windows SDK headers to include an appropriate method. Copying the bitmap explicitly to PARGB seems cleaner to me (and yields better performance).

Of course, after one finds out which undocumented function to use, google would also give some additional information: https://photosauce.net/blog/post/image-scaling-with-gdi-part-5-push-vs-pull-and-image-validation

GDI+ is not my favorite API.