Context of this is a function, which needs to run pretty much once per frame, and is therefore very critical performance-wise. This function contains a loop, and operations inside it.
private int MyFunction(int number)
{
// Code
for (int i = 0; i <= 10000; i++)
{
var value = i * number
var valuePow2 = value * value;
// Some code which uses valuePow2 several times
}
return 0; // Not actual line
}
Now, because of mathematical properties, we know that (a * b)² is equal to a² * b²
So, it would be possible to make my function into this:
private int MyFunction(int number)
{
// Code
var numberPow2 = number * number;
for (int i = 0; i <= 10000; i++)
{
var iPow2 = i * i
var valuePow2 = numberPow2 * iPow2;
// Some code which uses valuePow2 several times
}
return 0; // Not actual line
}
intuitively, this seems like it should be faster, since number² does not vary, and is now only calculated once outside of the loop. At the very least, this would be much faster for a human to do, because the x² operation is done on a much smaller number during the loop.
What I am wondering, is in C#, when you use types like int, will the multiplication actually be faster with smaller numbers?
For example, will 5 * 5 execute faster than 5000 * 5000?
If so, then the second version is better, even if by a small margin, because of that.
But if, for a given data type, the time is constant, then the first version of the function is better, because half of the calculations will be done on smaller numbers, because I do the same amount of multiplication in the loop both times, but in the second version I do one extra multiplication before the start.
I am aware that for all intent and purposes, the performance difference is negligible. I was suggested the second version in a Code Review because the function is critical, and I can't find any documentation to support either view.
For example, will 5 * 5 execute faster than 5000 * 5000?
For compile-time constants, 5 * x
is cheaper than 5000 * x
because the former can be done with lea eax, [rdi + rdi*4]
.
But for runtime variables, the only integer instruction with data-dependent performance is division. This applies on any mainstream CPU: pipelining is so important that even if some cases could run with lower latency, they typically don't because that makes scheduling harder. (You can't have the same execution unit produce 2 results in the same cycle; instead the CPU just wants to know that putting inputs in on one cycle will definitely result in the answer coming out 3 cycles later.)
(For FP, again only division and sqrt have data-dependent performance on normal CPUs.)
Code using integers or FP that has any data-dependent branching can be much slower if the branches go a different way. (e.g. branch prediction is "trained" on one sequence of jumps for a binary search; searching with another key will be slower because it will mispredict at least once.)
And for the record, suggestions to use Math.Pow
instead of integer *
are insane. Simply converting an integer to double
and back is slower than multiplying by itself with integer multiply.
Adam's answer links a benchmark that's looping over a big array, with auto-vectorization possible. SSE / AVX2 only has 32-bit integer multiply.
And 64-bit takes more memory bandwidth. That's also why it shows speedups for 16 and 8-bit integers. So it finds c=a*b
running at half speed on a Haswell CPU, but that does not apply to your loop case.
In scalar code, imul r64, r64
has identical performance to imul r32, r32
on Intel mainstream CPUs (since at least Nehalem), and on Ryzen (https://agner.org/optimize/). Both 1 uop, 3 cycle latency, 1/clock throughput.
It's only AMD Bulldozer-family, and AMD Atom and Silvermont, where 64-bit scalar multiply is slower. (Assuming 64-bit mode of course! In 32-bit mode, working with 64-bit integers is slower.)
For a fixed value of number
, instead of recalculating i*number
, compilers can and will optimize this to inum += number
. This is called a strength-reduction optimization, because addition is a "weaker" (slightly cheaper) operation than multiplication.
for(...) {
var value = i * number
var valuePow2 = value * value;
}
can be compiled into asm that does something like
var value = 0;
for(...) {
var valuePow2 = value * value;
...
value += number;
}
You might try writing it by hand that way, in case the compiler isn't doing it for you.
But integer multiplication is very cheap and fully pipelined on modern CPUs, especially. It has slightly higher latency than add, and can run on fewer ports (usually only 1 per clock throughput instead of 4 for add), but you say you're doing significant work with valuePow2
. That should let out-of-order execution hide the latency.
If you check the asm and the compiler is using a separate loop counter incrementing by 1, you could also try to hand-hold your compiler into optimizing the loop to use value
as the loop counter.
var maxval = number * 10000;
for (var value = 0; i <= maxval; value += number) {
var valuePow2 = value * value;
...
}
Be careful if number*10000
can overflow, if you need it to wrap correctly. In that case this loop would run far fewer iterations. (Unless number
is so big that value += number
also wraps...)