Search code examples
arraysperformancejuliaallocation

Why do allocations occur during broadcasting assignment to a preallocated array?


I am having trouble understanding why allocations are occurring during broadcast assignment, where all operations involved are themselves broadcasted. I am using Julia v1.10.

using BenchmarkTools
using Random

let
    k = 100
    n = 10000
    a = zeros(Int, n)
    b = zeros(Int, n)
    c = falses(n)
    d = falses(n)

    @btime rand!($a, 1:($k))
    @btime rand!($b, 1:($k))

    @btime $c .= $a .<= $b
    @btime $d .= $c .& ($b .!= 0)

    nothing
end
  27.792 μs (0 allocations: 0 bytes)
  28.541 μs (0 allocations: 0 bytes)
  2.560 μs (1 allocation: 4.19 KiB)
  7.552 μs (1 allocation: 4.19 KiB)

I would expect every line to be non-allocating; they're all writing to a pre-allocated array using broadcasted operations. I assumed this was equivalent to a for loop assigning the results to the destination. But c and d allocate to do the broadcasted assignment.

If I replace a .<= b with a <= b, c's allocation goes away:

  27.833 μs (0 allocations: 0 bytes)
  28.500 μs (0 allocations: 0 bytes)
  19.809 ns (0 allocations: 0 bytes)
  7.542 μs (1 allocation: 4.19 KiB)

But I cannot figure out how to make d's allocation go away without manually writing the loop:

@btime for i in eachindex($a)  # they all have the same indices; this is ok
    $d[i] = $c[i] && ($b[i] != 0)
end
  21.875 μs (0 allocations: 0 bytes)

Where am I going wrong in the broadcasted version that is leading to allocations?


Solution

  • The short answer is that BitArrays are packed and the version of julia you're running uses a specific optimization that will always allocate a 4096 byte array when materializing the broadcast into a BitVector with 256 elements or more.

    Array indexing on a 64 bit system generally works with 8 byte words. On this system, a julia array of Bool allocates up to 64 bits (realistically 1 byte/8 bits per boolean) for every boolean value even though a single bit can represent the two possible states.

    In a BitArray the booleans are represented by single bits, there is a nice interface on top that makes it look easy to set individual values but really the whole word needs to be updated at once. This makes performing vectorized operations efficiently non-trivial, see e.g. this issue before the first implementation of broadcasting for the bitarray operations. That implementation has been more or less the same until very recently. The 256 element cut off was added in 1.3. Now a non-allocating alternative was merged in November.

    with n = 100000:

    @btime for i in eachindex($a)
        $d[i] = $c[i] && ($b != 0)
    end
      259.271 μs (0 allocations: 0 bytes)
    
    @btime $d .= $c .& ($b .!= 0)
    161.250 μs (1 allocation: 4.19 KiB)
    

    Monkey patching in the new algorithm

    @eval Base.Broadcast @inline function copyto!(dest::BitVector, bc::Broadcasted{Nothing})
        axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
        ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
        destc = dest.chunks
        bcp = preprocess(dest, bc)
        length(bcp) <= 0 && return dest
        len = Base.num_bit_chunks(Int(length(bcp)))
        @inbounds for i = 0:(len - 2)
            z = UInt64(0)
            for j = 0:63
               z |= UInt64(bcp[i*64 + j + 1]::Bool) << (j & 63)
            end
            destc[i + 1] = z
        end
        @inbounds let i = len - 1
            z = UInt64(0)
            for j = 0:((length(bcp) - 1) & 63)
                 z |= UInt64(bcp[i*64 + j + 1]::Bool) << (j & 63)
            end
            destc[i + 1] = z
        end
        return dest
    end
    

    Now broadcasting assign doesn't allocate and has even better performance:

    @btime $d .= $c .& ($b .!= 0)
    140.260 μs (0 allocations: 0 bytes)