Return by reference is consistently faster than return value and stack with recursion

The below code consistently returns results around this: time - passed: 30 returned: 141 stacked: 212 for: 31 with roughly the same order always, my interpretation is:

The returned by reference function is just as fast as the for loop function, is this because the scope doesn't have to be preserved, nor reallocated on every call. It just uses the same memory while iterating the depth- just like the for loop? And maybe it can be very slightly quicker because the function itself doesn't have to allocate the result. While the returned function preserves the depth of every call before adding them all up once everything is called? And maybe the stack function is the slowest because it has to constantly free and request again the same memory?

At higher max depths the passed is slower than for by a bit, and returned is slower than stacked by a bit. But passed/for stay around at least 4x faster than stacked/returned

Are my guesses for the performance difference correct?

#include <iostream>
#include <array>
#include <intrin.h>
#include <vector>

const std::int32_t MAX_DEPTH = 20;
const int ITERATIONS = 100000000;
std::int32_t returned_calls = 0;
std::int32_t passed_calls = 0;
std::int32_t stacked_calls = 0;
std::int32_t for_calls = 0;

void passed_func(std::int32_t depth, std::int32_t& result) {
    depth += 1;
    result += depth;
    if (depth == MAX_DEPTH) {
    passed_func(depth, result);

std::int32_t returned_func(std::int32_t depth) {
    depth += 1;
    if (depth == MAX_DEPTH) {
        return depth;
    return returned_func(depth) + depth;

std::int32_t stacked_func() {
    std::int32_t result = 0;
    std::vector<std::int32_t> stack;
    while (true) {
        std::int32_t depth = stack.back();
        result += depth;
        if (depth == MAX_DEPTH) {
            return result;
        stack.emplace_back(depth + 1);

std::int32_t for_func() {
    std::int32_t result = 0;
    for (std::int32_t depth = 1; depth <= MAX_DEPTH; depth++) {
        result += depth;
    return result;

int main() {
    std::uint64_t passed_time = 0;
    std::uint64_t returned_time = 0;
    std::uint64_t stacked_time = 0;
    std::uint64_t for_time = 0;
    for (int i = 0; i < ITERATIONS; i++) {
        std::int32_t passed_result = 0;
        std::int32_t returned_result = 0;
        std::int32_t stacked_result = 0;
        std::int32_t for_result = 0;
        std::uint64_t t1 = __rdtsc();
        passed_func(0, passed_result);
        std::uint64_t t2 = __rdtsc();
        returned_result = returned_func(0);
        std::uint64_t t3 = __rdtsc();
        stacked_result = stacked_func();
        std::uint64_t t4 = __rdtsc();
        for_result = for_func();
        std::uint64_t t5 = __rdtsc();
        if (passed_result != returned_result || stacked_result != returned_result || for_result != returned_result) {
            return 1;
        passed_time += t2 - t1;
        returned_time += t3 - t2;
        stacked_time += t4 - t3;
        for_time += t5 - t4;
    if (returned_calls != passed_calls || stacked_calls != returned_calls || for_calls != returned_calls) {
        return 1;
        << "time - passed: " << passed_time / ITERATIONS
        << " returned: " << returned_time / ITERATIONS
        << " stacked: " << stacked_time / ITERATIONS
        << " for: " << for_time / ITERATIONS
        << std::endl;
    return 0;


  • Intressing observation. I was able to reproduce your result approximatly on my maschine.

    time - passed: 30 returned: 83 stacked: 197 for: 30

    One observation i made was that stacked may be accerelated if one delclares std::vector<std::int32_t> stack; as global variable and replace the definiton of the vecor in the function body by stack.clear() This avoids 99.999 heap allocations. After that replacement i got on my machine

    time - passed: 29 returned: 84 stacked: 77 for: 30

    But that's just by the way.

    I would say your summary is totally correct, but it have little to do wether you use return by value or by reference. What you describe is called tail call optimization. If a recursive function call is the last action of the function and after the call returns the calling function have nothing to do then return its result, the compiler can replace the recursive call by a simple jump to the functions begin, i.e. its behave almost like the for loop.
    With a small modification one can make also the returned_func version tail recursive:

    std::int32_t returned_func(std::int32_t depth,std::int32_t accumulator=0) {
        depth += 1;
        accumulator += depth;
        if (depth == MAX_DEPTH) {
            return accumulator;
        return returned_func(depth,accumulator);

    and voila, the returned_func version is almost as fast as the for version:

    time - passed: 29 returned: 36 stacked: 80 for: 31

    But so far all what i sayed is a educated guess. If we want to know it exactly, we have to look at the resulting assembler code. Here is a comparison of the your "returned" variant with the "passed" variant in the compiler explorer, using msvc19.latest /O2. https://godbolt.org/z/9avEK6soq we see in the "passed" version there is no recursive function call, its just replaced by a conditional jump: 'jne SHORT $LL9@passed_fun' while in the return version we have the recursive call 'call int returned_func(int)'

    With the modified returned_func https://godbolt.org/z/Mn9P4Yhbc both version result in very similar assembly code.

    edit: i wrote so long on my answer that i missed the ALX23z's answer, which came to the same conclusion.