Search code examples
parallel-processingbit-manipulationhardware-acceleration

On how many bits can I efficiently perform bitwise operations at most?


Provided contemporary hardware, I would like to know what is the maximum size of bit arrays on which I can efficiently (e.g. in 1 CPU cycle) perfrom bitwise operations. For example for 64bit processor, I would assume the answer is simply 64. Is this true? How much more I can get on a GPU or perhaps some exotic hardware? If I would like to construct an ASIC doing simply bitwise-or, how far could I get?


Solution

  • RX550 at 1325 MHz can do bitwise operations on integers at a rate of 893 giga integers per second. This means 28.5 tera bits per second. Dividing this to clock frequency,

    21581 bits per cycle (there are only 512 cores in this GPU, each doing 32 bit integer compute means 32*512=16384 bits per cycle but there are floating point units too, which must have been used to emulate integer operations to reach 21.6 kbit per cycle, maybe there are some other unknown units working too(such as 64-bit cores helping any bitwise operations))

    but ofcourse there is always higher latency than a CPU and if data needs to be taken through pci-e bridge, this would drop to 4GB/s which means 32 giga bits per second. This is slower than 1 core CPU. Its important how much compute is done per bit. If its just 1 operation, then sending to GPU wouldn't help much. If its 50+ operations per bit, you should send it to GPU or FPGA.

    Test kernel(opencl):

    __kernel void bitwise(__global int16 * data)
    { 
        int16 pData=data[get_global_id(0)];
        int16 pData2=pData&&1234123;
        for(int i=0;i<25;i++)
        {    
            pData|=(pData^55 && pData^120);
            pData2|=(pData2^55 && pData2^120);
        }
        data[get_global_id(0)]=pData&pData2;
    }
    

    test buffer is an 128M integers array.