Search code examples
performancerustoptimizationexpression

Is there a runtime cost of assigning variables in Rust?


Is there a runtime cost of assigning intermediate values to variables compared to just including their literals in an expression?

For example is temporary_assignments slower or otherwise inferior to inlined?

// Use variable assignment to assign names to the intermediate components of the 
// final answer, then combine those names in the final answer.
fn temporary_assignments(a: f32, b: f32, c: f32) -> f32 {
    let fourac = 4. * a * c;
    let discrim = b * b - fourac;
    let rad = discrim.sqrt();
    let denom = 2. * a;
    (-b + rad) / denom
}

// Express the entire final answer with literals.
fn inlined(a: f32, b: f32, c: f32) -> f32 {
    (-b + (b * b - 4. * a * c).sqrt()) / (2. * a)
}

In what situations does "inlining" expressions in this manner offer a runtime cost improvement compared to using variables to store intermediate values?


Solution

  • TLDR:

    No, intermediate variables do not have a runtime cost in Rust.

    Slightly longer TLDR:

    While different ways of writing a (semantically equivalent) function might lead to slightly different CPU instructions and performance, there is generally no correlation (neither positive nor negative) between the amount of intermediate variables and the quality of the assembly generated by a modern optimizing compiler like rustc. In many cases, the output will not differ at all.

    If you look at the optimized assembly output for both your examples, you will find that it is nearly identical, and should perform about the same too.

    Background:

    Just like C/C++, Rust is a statically compiled language, with an optimizing compiler backend (LLVM in case of rustc).

    Your CPU does not have a concept of 'variables', it has registers and memory. So the compiler analyzes the source code that you wrote, and tries to figure out the most efficient way to express that semantic using your CPU's instruction set.

    The Rust compiler with it's LLVM backend does this by

    1. Transforming your source code into an intermediate representation (IR) that is closer to actual machine instructions: LLVM-IR
    2. Transforming this IR using many algorithms and heuristics (called optimization passes), trying to make it more efficient.
    3. Converting the optimized LLVM-IR into CPU instructions for your actual target architecture. (And then optimize some more, less relevant here).

    What compiler optimizations can and cannot do:

    For example, here is the (non optimized) LLVM-IR for your inlined function:

    define float @inlined(float %0, float %1, float %2) unnamed_addr {
      %4 = fneg float %1
      %5 = fmul float %1, %1
      %6 = fmul float 4.000000e+00, %0
      %7 = fmul float %6, %2
      %8 = fsub float %5, %7
      %9 = call float @"<sqrt_function>"(float %8)
      %10 = fadd float %4, %9
      %11 = fmul float 2.000000e+00, %0
      %12 = fdiv float %10, %11
      ret float %12
    }
    

    As you can see, this representation uses virtual registers (e.g. %7) for the result of literally every single operation. These virtual registers will later be turned into the actual physical registers of your cpu architecture during register allocation.

    So your source code variables are already eliminated before serious opimization work has even begun.

    Now, this IR will run through many optimization passes. Some interesting ones are:

    • mem2reg: Instead of storing data in memory and retrieving it using load and store instructions, just keep the data in a register and operate on it directly (can be significantly more efficient, but is not always possible, we might have passed around references that rely on the data having a memory address). If you were copying around any structs in your example, this optimization pass (in combination with others) would help to get rid of many uneccessary copies.
    • cse: Common subexpression elimination: Reuse results we already computed somewhere else, instead of recomputing them.
    • inlining: Instead of calling a function, copy paste the IR of that function into our function. This gives the optimizer much more context about what is going on (which majorly helps other optimization passes), but can of course increase the binary size as we duplicated code, so we can't always do this. A good inlining strategy is often considered the most critical part of an optimizing compiler. This is also why virtual function calls (or dyn trait calls in Rust) are often considered expensive, because the compiler usually can't inline those.
    • Many more, like loop unrolling, scalar promotion, ...

    While lots of work goes into making backends like LLVM emit good assembly, these tools aren't magic. If you do something that the compiler doesn't 'understand' (== has an optimization pass for), it won't help you. So if performance is critical, use tools like the fantastic godbolt.org to figure out what is going on in your concrete case.

    Most importantly though, please don't do weird premature 'optimizations' like omitting intermediate variables that would improve readability for the sake of 'performance' without understanding what's actually going on :).

    Caveats:

    • The above assumes that we are talking about moving around small and simple variables like integers, floats, or references. If you clone a complex structure like a Vec or a large array, where actual work has to be done, this is of course different. (In some cases compilers can even optimize that away though.)
    • Many optimization passes are disabled in Debug mode, so make sure to benchmark Release builds.