Why do these factorial algorithms differ so much in performance

I have some questions about the following code

using UnityEngine;

    private void Calc(int n) {
        float time = Time.realtimeSinceStartup;
        (int fact, int fact1, int fact2, int fact3) =(1,1,1,1);
        for (int i = 1; i < n; i+=4) {
            fact *= i;
            fact1 *= i+1;
            fact2 *= i+2;
            fact3 *= i+3;
        }
        time = Time.realtimeSinceStartup - time;
        Debug.Log("calc: "+time);
    }

    private void Calc_2(int n) {
        float time = Time.realtimeSinceStartup;
        int fact = 1;
        for (int i = 2; i < n; i++) {
            fact *= i;
        }
        time = Time.realtimeSinceStartup - time;
        Debug.Log("calc_2: "+time);
    }

When n=2000000000, the calculation time of the two pieces of code is very different

calc: 0.4483032
calc_2: 1.34198

why is there such a big difference in performance between two similar pieces of code?

Solution

not coding in your environment so I might be wrong but modern CPUs can do several independent instructions at the "same" time (how many depends on how many prefetch statges there is inside CPU core)

if you unrol naive iteration:

for (fact=1,i = 2; i < n; i++) fact *= i;

you got:

...

i++;
if (i>=n) break;  // this is brunch which can mess things up but its true only once
fact *= i;        // here fact depends on the i from i++ before so it has to wait for its result

i++;
if (i>=n) break;
fact *= i;

i++;
if (i>=n) break;
fact *= i;

...

Now if you unrol your faster version:

for (int i = 1; i < n; i+=4) {
            fact *= i;
            fact1 *= i+1;
            fact2 *= i+2;
            fact3 *= i+3;
        }

you (on good compiler) should got something like this:

...

i+=4;
if (i>=n) break;  // this is brunch which can mess things up but its true just once    
i1 = i+1;         // these depends on i+=4 from above so it must wait
i2 = i+2;         
i3 = i+3;
fact  *= i;       // these depends on i+1,i+2,i+3 above so it must wait
fact1 *= i1;
fact2 *= i2;
fact3 *= i3;

i+=4;
if (i>=n) break; 
i1 = i+1;
i2 = i+2;
i3 = i+3;
fact  *= i;
fact1 *= i1;
fact2 *= i2;
fact3 *= i3;

i+=4;
if (i>=n) break; 
i1 = i+1;
i2 = i+2;
i3 = i+3;
fact  *= i;
fact1 *= i1;
fact2 *= i2;
fact3 *= i3;

...

so still a lot of waits but when you look closer you wait for instructions that was executed in most cases 4 instructions before so they most likely already executed so they might not wait a at all and if your CPU core has at least 4 prefetch stages it might execute 4 iterations at once ... if you enlarge the number of executed instructions per iterations the performance will stop improving once you have more "paralel" computations than you have prefetch stages on your CPU.

Also the faster version has 4 times less brunches (as it has 4 times less loop iterations) which has also some minor impact on speed.

However once you change your factorial to bigint as 2000000000! will not fit 32 nor 64 bit int this will no longer matter because even basic operations like ++,--,+,-,*,/ will no longer be just single instruction but entire program with its own complexity ...

For such high results you need to combine fast algorithms like these:

or better...