performance caching architecture pipeline

How could cache improve the performance of a pipeline processor?

I understand that accessing cache is much faster than accessing the main memory and I have a basic idea of all those miss rate and miss penalty stuff.

But this just came across my mind : how could cache be useful in a pipeline processor?

From my understanding, the time a single clock cycle takes is lower bounded by the longest time taken among all the processes. Like if accessing cache takes 1n, accessing main memory takes 10n, then the clock cycle time should be at least greater than 10n. Otherwise that task could not be completed when needed.. Then even the cache accessing is completed, the instruction still have to wait there until next clock cycle.

I was imaging a basic 5 stage pipeline process which includes instruction fetching, decoding, execution, memory accessing and write back.

Am I completely misunderstanding something? Or maybe in reality we have a much complex pipeline, where memory accessing is broken down to several pieces like cache checking and main memory accessing so that if we get an hit we can somehow skip the next cycle? But there will be a problem too if previous instruction didn't skip a cycle while the current instruction does...

I am scratching my head off... Any explanation would be highly appreciated!

Solution

The cycle time is not lower bounded by the longest time take among all processes. Actually, a RAM access can take hundreds of cycles. There are different processor architectures, but typical numbers might be:

1 cycle to access a register. 
4 cycles to access L1 cache.
10 cycles to access L2 cache.
75 cycles to access L3 cache.
hundreds of cycles to access main memory.

In the extreme case, if a computation is memory-intensive, and constantly missing the cache, the CPU will be very under-utilized, as it requests to fetch data from memory and waits until the data is available. On the other hand, if an algorithm needs to repeatedly access the same region of memory that fits entirely in L1 cache (like inverting a matrix that is not too big), the CPU will be much better utilized: The algorithm will start by fetching the data into cache, and the rest of the algorithm will just use the cache to read and write data values.