Why is one of these dynamic programming implementations of the Fibonacci sequence faster than the other?

I've been researching and benchmarking various Fibonacci algorithms recently for my own amusement and more or less by accident came up with an alternate implementation of the classic O(n) time and O(1) space dynamic programming implementation.

Consider the following two functions:

BigInt fib_dp_classic(int n) {
  if (n == 0) {
    return 0;
  }

  BigInt x = 0, y = 1, z;
  for (int i = 2; i <= n; ++i) {
    z = x + y;
    x = y;
    y = z;
  }

  return y;
}

and

BigInt fib_dp_mod(int n) {
  BigInt x = 0, y = 1, z = 1;
  for (int i = 0; i < n; ++i) {
    switch (i % 3) {
      case 0:
        y = x + z;
        break;
      case 1:
        z = x + y;
        break;
      case 2:
        x = y + z;
        break;
    }
  }

  switch (n % 3) {
    case 0:
      return x;
      break;
    case 1:
      return y;
      break;
    case 2:
      return z;
      break;
  }
}

On my machine, calculating the millionth Fibonacci number takes 6.55s with fib_dp_classic and 2.83 seconds with fib_dp_mod, and even turning on -O3 doesn't change this too much. I don't really have any good ideas as to why the mod version is faster. Is it because the extra store instructions in the classic version are more expensive than the mod in the second? It's my understanding that the compiler should be putting all three variables in registers in both versions and computing the mod is actually fairly expensive; is this not the case?

In fact, I just put both of these through compiler explorer and both are using only registers once you turn optimizations on. Granted, this is only using ints, not the GMP-based bigints I was actually using for my benchmark. Is there some weird GMP implementation detail that might be the cause here?

Update: I even strace'd both to see if malloc() might be the culprit and fib_dp_classic uses 130 syscalls (for n=1000000) while fib_dp_mod uses 133. So still no real clues...

Update 2: Yes, the buffer copies are the culprit (as geza pointed out) and I was dumb for not realizing. Here are two alternate versions and their benchmark results:

BigInt fib_dp_move(int n) {
  if (n == 0) {
    return 0;
  }

  BigInt x = 0, y = 1, z;
  for (int i = 2; i <= n; ++i) {
    z = std::move(x) + y;
    x = std::move(y);
    y = std::move(z);
  }

  return y;
}

This runs in 2.84 seconds, so just about equivalent to the mod version since it eliminates the unnecessary copies.

BigInt fib_dp_swap(int n) {
  if (n == 0) {
    return 0;
  }

  BigInt x = 0, y = 1, z;
  for (int i = 2; i <= n; ++i) {
    z = x + y;
    swap(x, y);
    swap(y, z);
  }

  return y;
}

This (from geza) also runs in 2.84 seconds, so again about equivalent to the mod version since it eliminates the copies in basically the same way, just calling swap() instead of using move semantics.

Solution

In this case, there is no point comparing the GMP version with the simple int version. Fib(1000000) is a ~86KB number, it is much slower to work with than a simple int. For ints, a copy can be basically free operation in certain circumstances, while for GMP numbers, it involves a copy of a 86KB buffer.

(Note, of course, the copy won't be 86KB always. In the beginning, it is ~0KB, but as the routine progresses, it grows to 86KB. And as the numbers grow, the routine becomes slower and slower, so the majority of the time is spent when the number is big)

Assuming a quality BigInt implementation, we have these operation counts in each iteration:

fib_dp_classic: 1 add, 2 copies
fib_dp_mod: 1 add

As you can see, the classic version does 2 extra copies (note, that in a quality implementation, x=y+z doesn't involve a copy). And the speed of a copy has the same order of magnitude as the add (I mean, an add may be 2x-3x slower than a copy). So this explains the ~2.3x slowdown of the classic routine.

Note, that if BigInt would be an implementation which uses reference counts, then x=y operation could be basically a free operation, because it doesn't need a copy, just incrementing a counter (In this case, the classic routine would have the same speed as the mod one).

Last note: presumably you can speed up the classic version, if a non-copying swap operation is available:

BigInt fib_dp_swap(int n) {
  if (n == 0) {
    return 0;
  }

  BigInt x = 0, y = 1, z;
  for (int i = 2; i <= n; ++i) {
    z = x + y;
    x.swap(y); // Note the 2 swaps here
    y.swap(z);
  }

  return y;
}

With gmpxx, this routine runs in the same time as the mod version, and much simpler.

It is because a swap operation can be much cheaper than a copy. Swap just needs to swap the pointers inside BigInt, while a copy needs a ~86KB memory-copy.