Why increased pipeline depth does not always mean increased throughput?

This is perhaps more of a discussion question, but I thought stackoverflow could be the right place to ask it. I am studying the concept of instruction pipelining. I have been taught that a pipeline's instruction throughput is increased once the number of pipeline stages is increased, but in some cases, throughput might not change. Under what conditions, does this happen? I am thinking stalling and branching could be the answer to the question, but I wonder if I am missing something crucial.

Solution

The throughout can be stalled by other instructions when waiting for a result, or on cache misses. Pipelining doesn't itself guarantee that the operations are totally independent. Here is a great presentation about the intricacies of the x86 Intel/AMD architecture: http://www.infoq.com/presentations/click-crash-course-modern-hardware

It explains stuff like this in great detail, and covers some solutions on how to further improve throughput and hide latency. JustJeff mentioned out-of-order execution for one, and you have shadow registers not exposed by the programmer model (more than 8 registers on x86), and you also have branch prediction.