Search code examples
pythonllvmnumba

How to use native popcount with numba


I am using numba 0.57.1 and I would like to exploit the native CPU popcount in my code. My existing code is too slow as I need to run it hundreds of millions of times. Here is a MWE:

import numba as nb

@nb.njit(nb.uint64(nb.uint64))
def popcount(x): 
      b=0
      while(x > 0):
          x &= x - nb.uint64(1)   
          b+=1
      return b
    
    
print(popcount(43))

The current speed is:

%timeit popcount(255)
148 ns ± 0.369 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

I believe that it should be possible to use the native CPU popcount (_mm_popcnt_u64 as a C intrinsic) instruction using numba intrinsics but this area is new to me. The llvm intrinsic I need to use is, I think, ctpop. Assuming this is right, how can I do this?


Solution

  • Try:

    import numba as nb
    from numba import types
    from numba.cpython import mathimpl
    from numba.extending import intrinsic
    
    
    @intrinsic
    def popcnt(typingctx, src):
        sig = types.uint64(types.uint64)
    
        def codegen(context, builder, signature, args):
            return mathimpl.call_fp_intrinsic(builder, "llvm.ctpop.i64", args)
    
        return sig, codegen
    
    
    @nb.njit(nb.uint64(nb.uint64))
    def popcount(x):
        return popcnt(x)
    
    
    print(popcount(43))
    

    Prints:

    4