parallel-processing processor instructions

Is reusing variables bad for instruction-level parallelism and OoO execution?

I'm studying processors and one thing that caught my attention is the fact that high-performance CPUs have the ability to execute more than one instruction during a clock cycle and even execute them out of order in order to improve performance. All this without any help from the compilers.

As far as I understood, the processors are able to do that by analysing data dependencies to determine which instructions can be run first/in a same ILP-paralell-step (issue).

@edit

I'll try giving an example. Imagine these two pieces of code:

int myResult;

myResult = myFunc1(); // 1
myResult = myFunc2(); // 2
j = myResult + 3;     // 3

int myFirstResult, mySecondResult;

myFirstResult = myFunc1();  // 1
mySecondResult = myFunc2(); // 2
j = mySecondResult + 3;     // 3

They both do the same thing, the difference is that on the first I reuse my variable and in the second I don't.

I assume (and please correct me if I'm wrong) that the processor could run instructions 2 and 3 before instruction 1 on the second example, because the data would be stored in two different places (registers?).

The same would not be possible for the first example, because if it runs instruction 2 and 3 before instruction 1, the value assigned on instruction 1 would be kept in memory (instead of the value from instruction 2).

Question is :

Is there any strategy to run instructions 2 and 3 before 1 if I reuse the variable (like in the first example)?

Or reusing variables prevent instruction-level parallelism and OoO execution?

Solution

A modern microprocessor is an extremely sophisticated piece of equipment and already has enough complexity that understanding every single aspect of how it functions is beyond the reach of most people. There's an additional layer introduced by your compiler or runtime which increases the complexity. It's only really possible to speak in terms of generalities here, as ARM processor X might handle this than ARM processor Y, and both of those differently from Intel U or AMD V.

Looking more closely at your code:

int myResult;

myResult = myFunc1(); // 1
myResult = myFunc2(); // 2
j = myResult + 3;     // 3

The int myResult line doesn't necessarily do anything CPU-wise. It's just instructing the compiler that there will be a variable named myResult of type int. It's not initialized, so there's no need to do anything yet.

On the first assignment the value is not used. By default the compiler usually does a pretty straight-forward conversion of your code to machine instructions, but when you turn on optimization, which you normally do for production code, that assumption goes out the window. A good compiler will recognize that this value is never used and will omit the assignment. A better compiler will warn you that the value is never used.

The second one actually assigns to the variable and that variable is later used. Obviously before the third assignment can happen the second assignment must be completed. There's not much optimizing that can go on here unless those functions are trivial and end up inlined. Then it's a matter of what those functions do.

A "superscalar" processsor, or one capable of running things out-of-order, has limitations on how ambitious it can get. The type of code it works best with resembles the following:

int a = 1;
int b = f();
int c = a * 2;
int d = a + 2;

int e = g(b);

The assignment of a is straightforward and immediate. b is a computed value. Where it gets interesting is that c and d have the same dependency and can actually execute in parallel. They also don't depend on b so theoretically they could run before, during, or after the f() call so long as the end-state is correct.

A single thread can execute multiple operations concurrently, but most processors have limits on the types and number of them. For example, a floating-point multiply and an integer add could happen, or two integer adds, but not two floating point multiply ops. It depends on what operations the CPU has, what registers those can operate on, and how the compiler has arranged the data in advance.

If you're looking to optimize code and shave nanoseconds off of things you'll need to find a really good technical manual on the CPU(s) you're targeting, plus spend untold hours trying different approaches and benchmarking things.

The short answer is variables don't matter. It's all about dependencies, your compiler, and what capabilities your CPU has.