Speeding up transform calculations

I am programming an OpenGL3 2D Engine. Currently, I am trying to solve a bottleneck. Please hence the following output of the AMD Profiler: http://h7.abload.de/img/profilerausa.png

The data was made using several thousand sprites.

However, at 50.000 sprites the testapp is already unusable at 5 fps.

This shows, that my bottleneck is the transform function I use. That is the corresponding function: http://code.google.com/p/nightlight2d/source/browse/NightLightDLL/NLBoundingBox.cpp#130

void NLBoundingBox::applyTransform(NLVertexData* vertices) 
{
    if ( needsTransform() )
    {
            // Apply Matrix
            for ( int i=0; i<6; i++ )
            {
                glm::vec4 transformed = m_rotation * m_translation * glm::vec4(vertices[i].x, vertices[i].y, 0, 1.0f);
                vertices[i].x = transformed.x;
                vertices[i].y = transformed.y;
            }
            m_translation = glm::mat4(1);
            m_rotation    = glm::mat4(1);
            m_needsTransform = false;
    }
}

I can't do that in the shader, because I am batching all sprites at once. That means, I have to use the CPU to calculate transforms.

My Question is: What is the best way to solve this bottleneck?

I don't use any threads atm, so when I use vsync, I get an extra performance hit too, because it waits for the screen to finish. That tells me I should use threading.

The other way to go would be to use OpenCL maybe? I want to avoid CUDA, because as far as I know it only runs on NVIDIA cards. Is that right?

post scriptum:

You can download a demo here, if you like:

http://www63.zippyshare.com/v/45025690/file.html

Please note, that this requires VC++2008 installed, because it is a debug version for running a profiler.

Solution

The first thing I would do is concatenate your rotation and transform matricies into one matrix before you enter the for-loop ... that way you aren't calculating two matrix multiplications and a vector on every for-loop; instead you would only be multiplying a single vector and matrix. Secondly, you may want to look into unrolling your loop and then compiling with a higher optimization level (on g++ I would use at least -O2, but I'm not familiar with MSVC, so you'll have to translate that optimization level yourself). That would avoid any overhead that branches in the code might incur, especially on cache-flushes. Lastly, if you haven't already looked into it, check into doing some SSE optimizations since you're dealing with vectors.

UPDATE: I'm going to add one last idea that would involve threading ... basically pipeline your vertices when you do your threading. So for instance, let's say you have a machine with eight available CPU threads (i.e., quad-core with hyper-threading). Setup six threads for the vertex pipeline processing, and use non-locking single-consumer/producer queues to pass messages between stages of the pipeline. Each stage will transform a single member of your six-member vertex-array. I'm guessing there are a bunch of these six-member vertex arrays, so setup in a stream that is passed through the pipeline, you can very efficiently process the stream, and avoid the use of mutexes and other locking semaphores, etc. For more info on a fast non-locking single-producer/consumer queue, see my answer here.

UPDATE 2: You only have a dual-core processor ... so dump the pipeline idea since it's going to run into bottlenecks as each thread contends for CPU resources.