Search code examples
assemblyx86cpu-registersmov

Time taken to load value from registers in x86


I am currently learning x86 from the book Art of Intel x86 Assembly. In the portion where it dwells on how different instruction classes and opcodes work, it says that one byte opcodes are encoded like iiirrmmm, where the three is denote the instruction class, rr denotes the 4 main registers, and mmm can be many values denoting optional 2 byte memory operand in the form of AX, [AX], [AX+XXXX], etc. For example, 101 corresponds to [XXXX+BX], 100 corresponds to [BX], etc. It also mentions at first that the time taken to access a value in register is zero clock cycle, since it is implemented on chip.

However, while explaining the time taken by an instruction to work completely, in order to figure out the time taken by the CPU to compute the address of the memory operand, it says this:

enter image description here

Won't it be 1 and 0 cycle respectively, because the book clearly mentions at the beginning that zero clock cycle is taken to access a value in a register? How come it is saying that 1 cycle is taken to access the value in BX?


Solution

  • It is very, very important that you understand that anything this book tells you about code execution speed is complete nonsense. The book is very old, 15 years is a lot of dog lives in processor development. Even what is visible from your screenshot was already no longer true back then, it gets dangerously untrue today.

    Next, the CPU computes the address of the memory operand

    No, not really. Operand address calculation was the job of the AGU, the "Address Generation Unit". A separate circuit on the processor core that ran independently from the main execution core. Which is why doing extra work can take 0 cpu cycles, the work is done concurrently. That did not stop at just the AGU, modern processors have many execution units that can complete jobs at the same time.

    We will assume that clock cycles and memory cycles are equivalent

    Not true back then, horribly untrue today. The memory bus is hundreds of times slower than the processor core. A problem that's related to distance, the further an electrical signal has to travel, the harder it gets to have it delivered at the destination without it getting corrupted. Only going slower can solve that problem. Modern processor with gigahertz clock rates invest heavily in caches, extra memory that stores a copy of the data in RAM. The L1 cache is very important, it can store 32 kilobytes of data and 32 kilobytes of instructions and sits closest to the processor core. Still takes 3 cpu cycles to read. L2 and L3 are bigger and, inevitably, sit further away and thus take more cycles. Any program that suffers an execution stall because it takes 150 cpu cycles to read the data from RAM will be a very poorly performing program of course, regardless of what instructions it uses.

    This is not where the discomfort stops, the entire premise of the book is very misleading today. Modern processors don't actually execute x86 instructions. They have the equivalent of a just-in-time compiler, the kind that is used in Java or .NET. They translate x86 instructions into "micro-ops", CISC instructions getting translated into RISC instructions. The kind that are easy to execute out-of-order and concurrently across multiple execution sub-units. Exactly what that looks like is a very well kept secret, companies like Intel and AMD treat it like intellectual property that nobody should know anything about. And most of all, nobody should ever take a dependency on since that will make it difficult for them to improve their processor designs.

    A clear casualty of this innovation is that talking about an instruction taking a certain number of CPU cycles is no longer meaningful. I already pointed you to Agner Fog's manuals. It talks about latency, the amount of time it takes to decode an instruction and get a result. And throughput, affected by the number of identical instructions that can be executed at the same time. Those numbers only give you a hint at how hard a processor needs to work, they are completely useless to predict actual execution time of a program. Add the state of the caches, the speed of the memory bus, the ability of the prefetcher to guess what memory location needs to be retrieved ahead of time, and the amount of luck the branch predictor has at guessing at the code flow as strong randomizers. Only a profiler can tell you how long it took.