Search code examples
assemblydivisionmicro-optimizationinteger-division6502

Fast signed 16-bit divide by 7 for 6502


I am working on an assembly language program for a 6502 cpu, and am finding that I need a fast-as-possible divide-by-seven routine, in particular one which could take a 16-bit dividend.

I am familiar with the routines found here, but generalizing the divide-by-seven routine found there is quite complicated, and a cursory examination of the general algorithm (using integer division)

x/7 ~= (x + x/8 + x/64 ... )/8

indicates that to handle a 16-bit range, it would likely take over 100 cycles to complete because of the 6502's single accumulator register and the relative slowness of individual memory bit shifts on the 6502.

I thought that a lookup table might help, but on the 6502, I am of course restricted to lookup tables that are 256 bytes or fewer. To that end one could assume the existence of two 256-byte lookup tables, xdiv7 and xmod7, which, when using an unsigned one-byte value as an index into the table, can quickly obtain the result of a byte divided by 7 or modulo 7, respectively. I am not sure how I could utilize these to find the values for the full 16-bit range, however.

In parallel, I also need a modulo 7 algorithm, although ideally whatever solution that can be worked out with the division will produce a mod7 result as well. If additional precomputed tables are needed, I am amenable to adding those, as long as the total memory requirements for all tables does not exceed about 3k.

Although I ultimately need a signed division algorithm, an unsigned one would suffice, because I can generalize that to a signed routine as needed.

Any assistance would be greatly appreciated.


Solution

  • Note: As @Damien_The_Unbeliever pointed out in the comments, the upperHigh and lowerLow tables are identical. So they can be combined into a single table. However, that optimization will make the code harder to read, and the explanation harder to write, so combining the tables is left as an exercise for the reader.


    The code below shows how to generate the quotient and remainder when dividing a 16-bit unsigned value by 7. The simplest way to explain the code (IMO) is with an example, so let's consider dividing 0xa732 by 7. The expected result is:

    quotient = 0x17e2
    remainder = 4  
    

    We start by considering the input as two 8-bit values, the upper byte and the lower byte. The upper byte is 0xa7 and the lower byte is 0x32.

    We compute a quotient and remainder from the upper byte:

    0xa700 / 7 = 0x17db
    0xa700 % 7 = 3 
    

    So we need three tables:

    • upperHigh stores the high byte of the quotient: upperHigh[0xa7] = 0x17
    • upperLow stores the low byte of the quotient: upperLow[0xa7] = 0xdb
    • upperRem stores the remainder: upperRem[0xa7] = 3

    And we compute the quotient and remainder from the lower byte:

    0x32 / 7 = 0x07
    0x32 % 7 = 1
    

    So we need two tables:

    • lowerLow stores the low byte of the quotient: lowerLow[0x32] = 0x07
    • lowerRem stores the remainder: lowerRem[0x32] = 1

    Now we need to assemble the final answer. The remainder is the sum of two remainders. Since each remainder is in the range [0,6] the sum is in the range [0,12]. So we can use two 13 byte lookups to convert the sum into a final remainder and a carry.

    The low byte of the quotient is the sum of that carry and the values from the lowerLow and upperLow tables. Note that the sum may generate a carry into the high byte.

    The high byte of the quotient is the sum of that carry and the value from the upperHigh table.

    So, to complete the example:

    remainder = 1 + 3 = 4              // simple add (no carry in)
    lowResult = 0x07 + 0xdb = 0xe2     // add with carry from remainder
    highResult = 0x17                  // add with carry from lowResult
    

    The assembly code to implement this consists of 7 table lookups, an add-without-carry instruction and two add-with-carry instructions.


    #include <stdio.h>
    #include <stdint.h>
    
    uint8_t upperHigh[256];  // index:(upper 8 bits of the number)  value:(high 8 bits of the quotient)
    uint8_t upperLow[256];   // index:(upper 8 bits of the number)  value:(low  8 bits of the quotient)
    uint8_t upperRem[256];   // index:(upper 8 bits of the number)  value:(remainder when dividing the upper bits by 7)
    uint8_t lowerLow[256];   // index:(lower 8 bits of the number)  value:(low  8 bits of the quotient)
    uint8_t lowerRem[256];   // index:(lower 8 bits of the number)  value:(remainder when dividing the lower bits by 7)
    uint8_t carryRem[13]    = { 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 };
    uint8_t combinedRem[13] = { 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5 };
    
    void populateLookupTables(void)
    {
        for (uint16_t i = 0; i < 256; i++)
        {
            uint16_t upper = i << 8;
            upperHigh[i] = (upper / 7) >> 8;
            upperLow[i] = (upper / 7) & 0xff;
            upperRem[i] = upper % 7;
    
            uint16_t lower = i;
            lowerLow[i] = lower / 7;
            lowerRem[i] = lower % 7;
        }
    }
    
    void divideBy7(uint8_t upperValue, uint8_t lowerValue, uint8_t *highResult, uint8_t *lowResult, uint8_t *remainder)
    {
        uint8_t temp = upperRem[upperValue] + lowerRem[lowerValue];
        *remainder = combinedRem[temp];
        *lowResult = upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp];
        uint8_t carry = (upperLow[upperValue] + lowerLow[lowerValue] + carryRem[temp]) >> 8;  // Note this is just the carry flag from the 'lowResult' calcaluation
        *highResult = upperHigh[upperValue] + carry;
    }
    
    int main(void)
    {
        populateLookupTables();
    
        uint16_t n = 0;
        while (1)
        {
            uint8_t upper = n >> 8;
            uint8_t lower = n & 0xff;
    
            uint16_t quotient1  = n / 7;
            uint16_t remainder1 = n % 7;
    
            uint8_t high, low, rem;
            divideBy7(upper, lower, &high, &low, &rem);
            uint16_t quotient2 = (high << 8) | low;
            uint16_t remainder2 = rem;
    
            printf("n=%u q1=%u r1=%u q2=%u r2=%u", n, quotient1, remainder1, quotient2, remainder2);
            if (quotient1 != quotient2 || remainder1 != remainder2)
                printf(" **** failed ****");
            printf("\n");
    
            n++;
            if (n == 0)
                break;
        }
    }