I've been wondering about this issue for a while. How to find the bottleneck of the graphical pipeline. Recently I've been using a program to draw massive amount of polygons in a simple scene with alpha blending (AKA grass scene). I've used two programs, one uses static coordinates and another uses rotation and translation. Both run at 60 FPS with no other heavy processes running. But when I use them together (Two windows each having same amount of grasses and grass positions) the one that uses translation and rotation runs at 10 FPS but the other one is about 55 FPS. My question is why are both running @ 60 FPS and when such thing happens why does the second one(Rotation and translation of each grass) drop about 50 FPS but the second is still 55? Sounds like a bottle-neck to me. Please let me know if you have any idea, or in a more general answer if you have an idea or paper about finding bottleneck of GPU(or GPGPU), or optimizing the graphical code for running on GPU?
Your problem actually is not a bottleneck on the GPU and neither your program, but in the driver. glRotate and glTranslate cause a lot of context switches into the driver mode which eat up performance. You're littlerally wasting all the time on bookkeeping instead of being productive.
Instancing has been introduced to alleviate this particular problem you encountered.
To answer how to profile the graphics pipeline, there are a number of tools to you help there:
gDEBugger http://www.gremedy.com/
NVPerfkit http://developer.nvidia.com/nvidia-perfkit
GPU Perf Studio http://developer.amd.com/tools/PerfStudio/Pages/default.aspx
Also it helps to collect some statistics in your program, mostly about the order and number of expensive calls (switching shaders, textures mostly).