How are GDI+ functions so fast?

I am trying to recreate very simple GDI+ functions, such as scaling and rotating an image. The reason is that some GDI functions can't be done on multiple threads (I found a work around using processes but didn't want to get into that), and processing thousands of images on one thread wasn't nearly cutting it. Also my images are grayscale, so a custom function would only have to worry about one value instead of 4.

No matter what kind of function I try to recreate, even when highly optimized, it is always SEVERAL times slower, despite being greatly simplified compared to what GDI is doing (I am operating on a 1D array of bytes, one byte per pixel)

I thought maybe the way I was rotating each point could be the difference, so I took it out completely, and basically had a function that goes through each pixel and just sets it to what it already is, and that was only roughly tied with the speed of GDI, even though GDI was doing an actual rotation and changing 4 different values per pixel.

What makes this possible? Is there a way to match it using your own function?

Solution

The GDI+ code is written in C/C++, or possibly even partially in assembly. Some GDI+ calls may use GDI, an old and well optimized API. You will find it difficult to match the performance, even if you know all the pixel manipulation tricks.