I am trying to optimize the QTC video codec to work on the Raspberry Pi with a decent performance. One important bottleneck is a 32 bits integer division done in the range decoder which takes into account for 18% of the decoding time. Since the device's ARM processor apparently lacks an integer division instruction, I think one could easily optimize this. The division has to be accurate.
Both the dividend and the divisor in that particular division are different each call, but it is known that the divisor is always smaller than 65536. I thought about building a lookup table of inverse divisor values. Using that table I could use a multiplication instead of the division. The lookup table is going to have a size of 256 kibibytes.
If you want to use a magic multiplication + LUT, here's some code.
Simple tester testing random divisors i. Did not exhaustively test all i's, but has worked for the short period I ran it. Seems to work on all 32-bit states of the dividend (j=0..2^32-1) for the i's I tested.
In reality, you would precompute a lookup table for i=2..64k-1 or some range like that (i=0 wont work because value/0 is undefined and i=1 wont work because the magic multiplier for that is just outside the range of a 32-bit number). Then use the equation using i as your lookup index to get the magic multiplier 'm'. Change as needed & don't hate on my style. :P
#include <stdio.h>
int main() {
unsigned int i,j,k,m,c;
// compute j/i,
// compute k = 2^32/i
// instead of j/i, use m = ~(j*k)>>32
srand(time(0));
for(c=0;c<64;c++) {
// generate random divisor i's for testing, then fully test every j
i = rand()&0x7fff;
// precompute these and put into a lookup table, index on [i]
k = (((__int64)1)<<32)/i;
for(j=0;j!=-1;j++) {
// status updater so we know it's working...
if(!(j&0xfffff)) { printf("%d : %d \r", i, j); fflush(0); }
// multiply instead of divide!
m = (((__int64)j*k)+k/2)>>32;
// rare fixup
if(j - m*i >= i) m++;
if(m != j/i) {
// as long as this line doesn't print, we're ok
printf("wrong : %d %d %d got: %d should be: %d\n",
i, j, k, m, j/i);
}
}
}
}