Instruction less than one clock cycle

I sometimes read that there are instructions which take less than a clock cycle - how is this possible? Or is this the value when pipelining and out-of-order comes in the game?

Solution

from http://en.wikibooks.org/wiki/Microprocessor_Design/Performance_Metrics

(...)Historically, all early computers used many clock cycles during the execution of even the simplest instruction. During the RISC revolution, many designers focused on reducing this factor closer to the apparent minimum of 1 cycle per instruction. We will discuss some of the techniques used later in this book. Since then, CPUs that use techniques such as superscalar execution and multicore computing have reduced this even further. Such CPUs can (on average) use less than 1 cycle per instruction.

"CPI" is a throughput measure of how many instructions are completed (on average) for a given number of clocks. A CPU that can complete, on average, 2 instructions per cycle (a CPI of 0.5) may have a 20 stage pipeline, which inevitably causes a 20 cycle latency between an instruction fetch to the completion of that instruction. We ignore those 20 cycles when we calculate CPI.(...)