assembly cpu cpu-architecture pipelining

How to ensure that one instruction is finished before a second instruction is started on a pipelined CPU?

Supposing there are two sequential instructions, like this:

instruction A
instruction B

Because of CPU pipelining, B will start before A finishes.

Does there exist a mechanism to ensure B starts after A finishes?

UPDATE:
I'm sorry about that I have not described the problem accurately. I mean, this two instructions has application-level ordering dependency but no hazard. For example, in a transaction system, the first instruction is flushing logs to persistent storage and the second instruction is noticing clients about transaction commit. So we can't execute the second instruction until the first finish. How to provide this execution order?

Solution

Because of CPU pipelining, B will start before A finishes.

So? Why is this a problem?

In a basic pipelined architecture, instruction A will start executing on the first cycle, and then instruction B will start executing on the next cycle.

Taking the basic 5-stage RISC pipeline as an example, it would look like this:

Clock Cycle   |    1    |    2     |     3     |      4      |      5      |      6      |
--------------|---------------------------------------------------------------------------
Instruction A |  Fetch  |  Decode  |  Execute  | Mem. Access |  Writeback  |
Instruction B |         |  Fetch   |  Decode   |   Execute   | Mem. Access |  Writeback  |

The processor would begin fetching instruction A on the first clock cycle. On the second clock cycle, it would begin decoding instruction A, while it was simultaneously fetching instruction B. And so on, down the pipeline.

The reason this works so well is that the instruction fetching unit is a completely separate piece of hardware from the instruction decoding unit (even though both may be implemented on the same slab of silicon), so it makes sense to keep each of these units occupied simultaneously. This is one mechanism for achieving instruction-level parallelism (ILP).

Ultimately, you can see that Instruction A will complete on cycle 5, while instruction B won't complete until cycle 6. Still, that's way better than instruction A completing on cycle 5 and instruction B not being able to start until cycle 6, deferring its completion until cycle 11.

Logic internal to the processor handles instructional dependencies, so if instruction B somehow relies on the result of instruction A, the processor's decoder will be able to detect that and will stall the execution of instruction B until its data is available (i.e., until instruction A gets far enough along in the pipeline that its results are ready). This is all handled seamlessly for you, but it does introduce performance costs (pipeline bubbles), so you want to avoid it whenever possible. That means writing your code so that instructions with dependencies are spread out as far as possible from each other, with independent instructions interspersed in between.

Does there exist a mechanism to ensure B starts after A finishes?

Yes, such mechanisms generally exist, but you don't usually want to use them because they slow execution down by defeating the entire advantage of a pipeline.

These mechanisms are referred to as serializing instructions (or sometimes "barriers") because they erect a barrier that causes execution to be serialized at a particular point.

For example, on the x86 architecture, the CPUID instruction is a serializing instruction (actually, one of several). So you could do:

Instruction A
CPUID
Instruction B

and this would ensure that instruction B will not start until after instruction A finishes executing.

From the Intel architecture manuals:

CPUID can be executed at any privilege level to serialize instruction execution. Serializing instruction execution guarantees that any modifications to flags, registers, and memory for previous instructions are completed before the next instruction is fetched and executed.

See also: "Serializing Instructions" in Chapter 7 of the IA-32 Intel Architecture Software Developer's Manual, Volume 3 AP-485, Intel Processor Identification and the CPUID Instruction.

Technically, this doesn't guarantee that Instruction B won't start down the pipeline. The processor might, for example, decode and fetch instruction B before it finishes executing instruction A. However, from the programmer's perspective (i.e., observable behavior), it will be as if instruction B is only started after instruction A has finished.