Search code examples
rustinteger

Should I use i32 or i64 on 64bit machine?


main.rs

#![feature(core_intrinsics)]
fn print_type_of<T>(_: &T) {
    println!("{}", unsafe { std::intrinsics::type_name::<T>() });
}

fn main() {
    let x = 93;
    let y = 93.1;

    print_type_of(&x);
    print_type_of(&y);
}

If I compile with "rustc +nightly ./main.rs", i got this output:

$ ./main

i32
f64

I run a x86_64 Linux machine. Floating point variables are double precision by default, which is good. Why integers are only 4 bytes? Which should I use? If I don't need i64 should I use i32? Are i32 better for performance?


Solution

  • Are i32 better for performance?

    That's actually kind of a subtle thing. If we look up some recent instruction-level benchmarks for example for SkylakeX, there is for the most part a very clear lack of difference between 64bit and 32bit instructions. An exception to that is division, 64bit division is slower than 32bit division, even when dividing the same values (division is one of the few variable-time instructions that depend on the values of its inputs).

    Using i64 for data also makes auto-vectorization less effective - this is also one of the rare places where data smaller than 32bit has a use beyond data-size optimization. Of course the data size also matters for the i32 vs i64 question, working with sizable arrays of i64's can easily be slower just because it's bigger, therefore costing more space in the caches and (if applicable) more bandwidth. So if the question is [i32] vs [i64], then it matters.

    Even more subtle is the fact that using 64bit operations means that the code will contains more REX prefixes on average, making the code slightly less dense meaning that less of it will fit in the L1 code cache at once. This is a small effect though. Just having some 64bit variables in the code is not a problem.

    Despite all that, definitely don't overuse i32, especially in places where you should really have an usize. For example, do not do this:

    // don't do this
    for i in 0i32 .. data.len() as i32 { 
      sum += data[i as usize]; 
    }
    

    This causes a large performance regression. Not only is there a pointless sign-extension in the loop now, it also defeats bounds check elimination and auto-vectorization. But of course there is no reason to write code like that in the first place, it's unnatural and harder than doing it right.


    On Alder Lake (P cores) and Raptor Lake (P cores) (and possibly newer CPUs), there are some quirky differences between 32-bit addition and 64-bit addition with an immediate operand. It's not as simple as the table makes it look, the reported latency is not a fair representation of the mechanism that causes it. The mechanism (see "immediate additions treated as NOPs") is not fully explained but involves front-end trickery (rewriting µops during the rename phase to take an offset, something like that), not an 0.17 cycle adder (which if it could exist at all would also benefit additions between two registers, but that's still a 1 cycle operation).

    However the mechanism works exactly, it means that 64-bit additions with an immediate are favoured over 32-bit additions with an immediate on some CPUs in some contexts, namely when the same register is added to (or subtracted from for that matter) multiple times in a row with no intervening non-addition modifications of that register. That sort of pattern is followed by many loop counters and array iterators (and not much else) - many of which are 64-bit already.