Why don't x86-64 (or other architectures) implement division by 10?

I'm familiar with compilers replacing division and modulo by a constant (e.g. x/10) with multiplication by a magic number (x * 0xcccccccd) >> (32 + 3). Compiler example here.

        movl    $1717986919, %edx
        addl    $12, %esp
        movl    %ecx, %eax
        imull   %edx
        movl    %ecx, %eax
        sarl    $31, %eax
        sarl    $2, %edx
        subl    %eax, %edx
        leal    (%edx,%edx,4), %eax
        addl    %eax, %eax
        subl    %eax, %ecx

At the same time, I was a little bit surprised that there isn't some x86-64 instruction that does:

edx <= eax % 10
eax <= eax / 10

It seems like division by 10 is a fairly common operation for int-to-string or anything that requires decimal bases. x86-64 is also known for providing somewhat-niche instructions, many of which combine one or more common instructions (for instance, fused-multiply-add extensions.) It seems like the silicon could potentially do slightly better than the compiler?

I'm hoping to find an answer with some documentation or history behind it (not just speculation), if one exists. Though I understand it may be hard to find explanations for why architecture developers haven't done something.

Solution

Ever since the mid 1990s and thus several years before work started on x86-64 it was generally known that integer division by a constant does not require an actual division operation, but can be computed via integer multiplication:

Torbjörn Granlund and Peter L. Montgomery, "Division by invariant integers using multiplication." Proceedings of the ACM SIGPLAN 1994 conference on programming language design and implementation, June 1994, pp. 61-72

There are applications that use integer-to-ASCII conversion a lot making it relevant to application performance. I seem to recall early web browsers were among these applications, as well as various software used in commercial processing. But such a conversion does not even need many division-by-constant operations either. One can convert the integer into a binary fixed-point representation once and then pull out the decimal digits one by one using multiplication with ten.

This technique of general base conversions is so old, I do not even have a reference for it. It was probably used manually prior to the advent of electronic computers. This is also how the FBSTP instruction in an x87 FPU works internally, or at least that is how I implemented it for the FPU of the Athlon processor. Now, FBSTP is rarely used, and microcode ROM is a very expensive resource, so I optimized the microcode for this for space rather than speed.

An implementation in ISO-C99 of this approach for the conversion of a 32-bit integer might look like the code below. The umul32_wide() function is a C-level equivalent of an x86 MUL instruction. To make this code run fast, one would want a fast integer multiplier. Providing fast general integer multiplication is important for performance in many contexts, and this was certainly known to the x86-64 design team.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

void umul32_wide (uint32_t a, uint32_t b, uint32_t *hi, uint32_t *lo)
{
    uint64_t prod = (uint64_t)a * b;
    *hi = (uint32_t)(prod >> 32);
    *lo = (uint32_t)prod;
}

void cvt_uint32 (uint32_t x, char *out)
{
    uint32_t hi, lo, num, digcnt;
    uint32_t dig;

    /* find first digit, in [0,4], separately */
    umul32_wide (x, 0x89705f41, &hi, &lo); // 2**61 / 10**9 rounded
    num = hi + (lo >> 31);
    dig = num >> 29;
    x = x - 1000000000 * dig;
    *out++ = dig | '0';

    /* convert number in [0, 999999999] to 4:28 fixed-point */
    umul32_wide (x, 0xabcc7712, &hi, &lo); // (2**28 / 10**8) * 2**30 rounded up
    num = (hi << 2) + (lo >> 30) + 1;

    /* pull out decimal digits one at a time */
    dig = (num >> 28) | '0';
    num = num & 0x0fffffff;
    digcnt = 8;
    do {
        *out++ = dig;
        num = num * 10;
        dig = (num >> 28) | '0';
        num = num & 0x0fffffff;
        digcnt--;
    } while (digcnt);
    *out = dig;
}

int main (void)
{
    char res [11] = {0};
    char ref [11] = {0};

    printf ("Testing uint32 exhaustively\n");
    unsigned int x;
    x = 0;
    do {
        cvt_uint32 (x, res);
        sprintf (ref, "%010u", x);
        if (memcmp (res, ref, 10) != 0) {
            printf ("error: t=%08u  res=%s  ref=%s\n", x, res, ref);
            return EXIT_FAILURE;
        }
        x++;
    } while (x);
    return EXIT_SUCCESS;
}

One could ask: But what about the multiplication by ten in the loop? It is not needed. One could completely unroll the loop, replace it with a multiply-by-5 and dynamically adjust the fixed-point representation by one bit per step. A high-performance multiply-by-5 capability already existed prior to x86-64 in the form of the LEA instruction.

Another observation would be that the code shown above is of a serial nature, with limited ILP. That can easily be fixed by partitioning an integer into chunks (possibly recursively) using divisions-by-constant (10¹⁰, 10⁵) to create the chunks. Today even cleverer approaches are known as has been pointed out in comments.