Problem with memory allocation in Julia code

I used a function in Python/Numpy to solve a problem in combinatorial game theory.

import numpy as np
from time import time

def problem(c):
    start = time()
    N = np.array([0, 0])
    U = np.arange(c)
    
    for _ in U:
        bits = np.bitwise_xor(N[:-1], N[-2::-1])
        N = np.append(N, np.setdiff1d(U, bits).min())

    return len(*np.where(N==0)), time()-start 

problem(10000)

Then I wrote it in Julia because I thought it'd be faster due to Julia using just-in-time compilation.

function problem(c)
    N = [0]
    U = Vector(0:c)
    
    for _ in U
        elems = N[1:length(N)-1]
        bits = elems .⊻ reverse(elems)
        push!(N, minimum(setdiff(U, bits))) 
    end
    
    return sum(N .== 0)
end

@time problem(10000)

But the second version was much slower. For c = 10000, the Python version takes 2.5 sec. on an Core i5 processor and the Julia version takes 4.5 sec. Since Numpy operations are implemented in C, I'm wondering if Python is indeed faster or if I'm writing a function with wasted time complexity.

The implementation in Julia allocates a lot of memory. How to reduce the number of allocations to improve its performance?

Solution

The original code can be re-written in the following way:

function problem2(c)
    N = zeros(Int, c+2)
    notseen = falses(c+1)

    for lN in 1:c+1
        notseen .= true
        @inbounds for i in 1:lN-1
            b = N[i] ⊻ N[lN-i]
            b <= c && (notseen[b+1] = false)
        end
        idx = findfirst(notseen)
        isnothing(idx) || (N[lN+1] = idx-1)
    end
    return count(==(0), N)
end

First check if the functions produce the same results:

julia> problem(10000), problem2(10000)
(1475, 1475)

(I have also checked that the generated N vector is identical)

Now let us benchmark both functions:

julia> using BenchmarkTools

julia> @btime problem(10000)
  4.938 s (163884 allocations: 3.25 GiB)
1475

julia> @btime problem2(10000)
  76.275 ms (4 allocations: 79.59 KiB)
1475

So it turns out to be over 60x faster.

What I do to improve the performance is avoiding allocations. In Julia it is easy and efficient. If any part of the code is not clear please comment. Note that I concentrated on showing how to improve the performance of Julia code (and not trying to just replicate the Python code, since - as it was commented under the original post - doing language performance comparisons is very tricky). I think it is better to concentrate in this discussion on how to make Julia code fast.

EDIT

Indeed changing to Vector{Bool} and removing the condition on b and c relation (which mathematically holds for these values of c) gives a better speed:

julia> function problem3(c)
           N = zeros(Int, c+2)
           notseen = Vector{Bool}(undef, c+1)

           for lN in 1:c+1
               notseen .= true
               @inbounds for i in 1:lN-1
                   b = N[i] ⊻ N[lN-i]
                   notseen[b+1] = false
               end
               idx = findfirst(notseen)
               isnothing(idx) || (N[lN+1] = idx-1)
           end
           return count(==(0), N)
       end
problem3 (generic function with 1 method)

julia> @btime problem3(10000)
  20.714 ms (3 allocations: 88.17 KiB)
1475