Search code examples
multithreadingpipelinecpu-architecturegranularity

why do pipeline constraints of Coarse-grained multithreading and Fine-grained multithreading differ?


In "Computer Organization and Design: The Hardware/ Software Interface, Sixth Edition" RISCV Edition by David A. Patterson and John L. Hennessy chapter 6.4, it says about "coarse-grained multithreading":

This change relieves the need to have thread switching be extremely fast and is much less likely to slow down the execution of an individual thread, since instructions from other threads will only be issued when a thread encounters a costly stall.

Because a processor with coarse-grained multithreading issues instructions from a single thread, when a stall occurs, the pipeline must be emptied or frozen. The new thread that begins executing after the stall must fill the pipeline before instructions are able to complete.

But about "Fine-grained multithreading", it doesn't refer to changes to pipeline when switching threads:

This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that clock cycle.

Q: Since the book says:

A thread includes the program counter, the register state, and the stack.

and both categories of multithreading begins switching threads when encountering stalls, why must Coarse-grained multithreading need pipeline be empty because pipeline instruction source is only from a single thread and then fill the pipeline but "Fine-grained multithreading" not?


Solution

  • I think the point is that if you're going to have two sets of register state, page tables, FP exception state, etc. that can be active at once, you might as well do fine-grained multithreading.

    So it wouldn't be a good tradeoff to make a coarse-grained multithreading CPU that paid most of the cost to support fine-grained multithreading. In this paragraph at least, that looks like an unstated assumption, but perhaps they discuss it elsewhere.

    The benefit of only doing coarse-grained multithreading this way is that you don't need to support having instructions from different contexts in the pipeline at once, simplifying things such as FP exceptions and rounding mode to not need to be per-instruction.

    Architectural state for the thread being swapped out can get saved to special storage that's only accessed by the hardware-context-switching logic, instead of extra tag bits in a bunch of things, and a RAT with twice as many entries.

    (As Dr. Bandwidth comments, fine-grained multithreading is usually only used in CPUs with out-of-order exec and register renaming.)