Search code examples
cassemblyoptimizationx86hammingweight

Bit popcount for large buffer, with Core 2 CPU (SSSE3)


I'm looking for the fastest way to popcount on large buffer of 512 or more bytes. I can guarantee any required alignment, and the buffer size is always a power of 2. The buffer corresponds to block allocations, so typically the bits are either all set, none set, or mostly set favoring the "left" of the buffer, with occasional holes.

Some solutions I've considered are:

I'm interested in the fastest solution, it must work on 32bit x86 chipset belonging to core2 or more recent. SSE and SIMD are of great interest. I'll be testing on the following quad core CPU:

matt@stanley:~/anacrolix/public/stackoverflow$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
stepping        : 11
cpu MHz         : 1600.000
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow vnmi flexpriority
bogomips        : 4800.21
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

Solution

  • I outline the best C/assembly functions I found for population count/Hamming weight of large buffers below.

    The fastest assembly function is ssse3_popcount3, described here. It requires SSSE3, available on Intel Core 2 and later, and AMD chipsets arriving in 2011. It uses SIMD instructions to popcount in 16 byte chunks and unrolls 4 loop iterations at a time.

    The fastest C function is popcount_24words, described here. It uses the bit-slicing algorithm. Of note I found that clang could actually generate appropriate vector assembly instructions, which gave impressive performance increases. This aside, the algorithm is still extremely fast.