everybody.
I have few questions:
Thanks in advance for your replies.
The answer is it depends, the graphic card wiring tries to reduce the number of times a vertex is shaded to the minimum depending on buffering and batching. this is all explained here: http://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/ There are multiple cache lines of vertex buffer tags (hear index that are pulled from the index buffer). the IA unit pulls vertices from the index buffer and fills the cache, then when it is full (modulo primitive size), it is sent to the card scheduler hardware that will dispatch the cache line to a block of shading cores. Then the IA stage will continue to fill a new cache line in parallel of that cores unit working on the previous request. And never waits until index buffer is fully depleted or core units are all busy. When results come back they put the shaded vertice data into some piece of memory that will be referenced by primitive assembly later.
there are 2 different stages, input assembly (just followed by vertex shading) and primitive assembly which comes later. graphics pipeline is a bit more specialized than generic kernels and I doubt all stages are implemented as generic kernels. particularly on slightly older hardware, notably the ones with specialized shaded vertex output memory, they need special wiring.
check the article series, its all explained, there isn't a perfect 1-1 relation, some vertices get re-shaded if they are too far in the index buffer.