Why is my Julia code is slow compare to C++?

I wrote the next code and measured its execution time.

function main()

    function Fib(n::Int)
        if (n == 0 || n == 1)
            return n
        end

        return(Fib(n - 2) + Fib(n - 1))
    end

    res = Fib(46)

    println(res)
end

@time main()

1836311903 123.625395 seconds (5.72 M allocations: 88.103 MiB, 0.02% gc time, 0.02% compilation time)2% compilation time)

I use VS Code and run the program with F1 -> Julia: Execute file in REPL. Isn't it slow for Julia compared to the same C++ code, which takes only about 16.5s? What am I doing wrong?

I heard that sometimes Julia could be slower than C++, but I hadn't expected such a big gap.

Solution

TL;DR: Julia is much slower in this benchmark because Fib is not a top-level function.

This benchmark overall measures the cost of function calls. Julia function calls tend to be more expensive than C++ one by design. Indeed, as stated in the Julia documentation:

Every function in Julia is a generic function. A generic function is conceptually a single function, but consists of many definitions, or methods. The methods of a generic function are stored in a method table.
Given the call f(x,y), the following steps are performed: first, the method table to use is accessed as typeof(f).name.mt. Second, an argument tuple type is formed, Tuple{typeof(f), typeof(x), typeof(y)}. [...]. This tuple type is looked up in the method table.
This dispatch process is performed by jl_apply_generic.

I tried to profile your code with a low-level profiler and I can confirm that most of the time is spent in ijl_apply_generic which is the generic dispatch mechanism. To be more precise, about 60% of the time is spent in that, 5% in integer boxing, and 5% in few other Julia internals like the stack management. Put it shortly, most of the time is simply pure overhead.

That being said, Julia can apparently optimizes this dispatch mechanism if Fib is a top-level function. I guess it consider Fib as a closure internally in your code and it is harder to optimize closure because of their fundamentally more dynamic behavior. What this means is that you just need to move the Fib function outside the main function so to make the function much faster.

Here is the corrected code:

function Fib(n::Int)
    if (n == 0 || n == 1)
        return n
    end

    return(Fib(n - 2) + Fib(n - 1))
end

function main()
    res = Fib(46)
    println(res)
end

@time main()

Here are results on my machine for Fib(42):

Initial Julia code:     12.859 seconds
Fixed Julia code:        1.100 seconds     <-----

Unoptimized C++ code:    1.265 seconds
Optimized C++ code:      0.593 seconds

Note that the Julia version is 1.9.4 and the C++ code has been compiled with Clang 16.0.6 (using the flags -O3 -march=native for the optimized version).

The Julia version is still not as fast as the C++ implementation, but "only" about twice slower. This is because Julia does not implement the Tail Call Optimization (TCO) while C++ compilers (like Clang) generally does. This is apparently because of the multiple dispatch mechanism discussed above (see this discussion).

In the specific case of Fibonacci, note that this implementation is fundamentally inefficient! Indeed, one can simply use an iterative implementation without recursive calls and get much better performance (even in C++). In general, (non-inlined) function calls are expensive whatever the target language so they should be avoided in the first place.