While experimenting with different ways to code the leaves of a Schoenhage radix conversion tree I stumbled over the problem that the compilers (GCC, clang) optimize division by a small constant with multiplication with the reciprocal. As they should, no complains. So I decided to add a little inline-assembly to get comparable benchmarks but what I got instead were segfaults.
The code (not the most minimal example, but some context might help)
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#define ARRAY_LENGTH 33
static const char digits[] = { '0', '1', '2', '3', '4', '5', '6', '7', '8',
'9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H',
'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q',
'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r',
's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '+',
'/' };
static void to_radix_recursive(unsigned int a, unsigned int b,
char *result, int *index) {
unsigned int r = 0u, q = 0u;
int i;
char c;
if (a == 0u) {
return;
}
#ifdef CZ_USE_ASM
//printf("BEFORE a = %u, b = %u, q = %u, r = %u\n", a, b, q, r);
__asm__("xorl %%edx, %%edx;"
"movl %2, %%eax;"
"movl %3, %%ebx;"
"divl %%ebx;"
: "=a"(q), "=d"(r)
: "g"(a), "g"(b)
);
//printf("AFTER a = %u, b = %u, q = %u, r = %u\n\n", a, b, q, r);
#else
q = a / b;
r = a % b;
#endif
to_radix_recursive(q, b, result, index);
c = digits[r];
i = *index; /* Line 41 */
result[i] = c;
(*index)++;
}
int main(void) {
int idx;
unsigned int a, b;
char result[ARRAY_LENGTH] = {'\0'};
/* All checks and balances ommitted! */
/* 0 < a <= UINT_MAX */
a = 1234567;
#ifdef CZ_USE_CONSTANT
/* Most compilers optimimize to multiplication with reciprocal here */
b = 10;
#else
/* Should press the optimizer to use "divl" */
for (b = 2u; b < 64u; b++) {
idx = 0;
to_radix_recursive(a, b, result, &idx);
printf("Result recursive = %s for radix %u\n", result,b);
for (int i = 0; i < ARRAY_LENGTH; i++) {
result[i] = '\0';
}
}
#endif
exit(EXIT_SUCCESS);
}
Expected output with -DCZ_USE_ASM
Result recursive = 100101101011010000111 for radix 2
Result recursive = 2022201111201 for radix 3
Result recursive = 10231122013 for radix 4
Result recursive = 304001232 for radix 5
Result recursive = 42243331 for radix 6
Result recursive = 13331215 for radix 7
Result recursive = 4553207 for radix 8
Result recursive = 2281451 for radix 9
Result recursive = 1234567 for radix 10
Result recursive = 773604 for radix 11
Result recursive = 4B6547 for radix 12
Result recursive = 342C19 for radix 13
Result recursive = 241CB5 for radix 14
Result recursive = 195BE7 for radix 15
Result recursive = 12D687 for radix 16
Result recursive = ED4EA for radix 17
Result recursive = BDC71 for radix 18
Result recursive = 98IG4 for radix 19
Result recursive = 7E687 for radix 20
Result recursive = 6769J for radix 21
Result recursive = 55KGF for radix 22
Result recursive = 49AHJ for radix 23
Result recursive = 3H787 for radix 24
Result recursive = 3407H for radix 25
Result recursive = 2I679 for radix 26
Result recursive = 28JDJ for radix 27
Result recursive = 206JJ for radix 28
Result recursive = 1LHS8 for radix 29
Result recursive = 1FLM7 for radix 30
Result recursive = 1ADKN for radix 31
Result recursive = 15LK7 for radix 32
Result recursive = 11BM4 for radix 33
Result recursive = VDWR for radix 34
Result recursive = SRSC for radix 35
Result recursive = QGLJ for radix 36
Result recursive = ODTP for radix 37
Result recursive = MIaN for radix 38
Result recursive = KVQM for radix 39
Result recursive = JBO7 for radix 40
Result recursive = HbHG for radix 41
Result recursive = GRaJ for radix 42
Result recursive = FMTb for radix 43
Result recursive = ELUF for radix 44
Result recursive = DOTb for radix 45
Result recursive = CVKJ for radix 46
Result recursive = BffI for radix 47
Result recursive = B7e7 for radix 48
Result recursive = AO9C for radix 49
Result recursive = 9hfH for radix 50
Result recursive = 9FXA for radix 51
Result recursive = 8eTZ for radix 52
Result recursive = 8FQc for radix 53
Result recursive = 7jKJ for radix 54
Result recursive = 7N6b for radix 55
Result recursive = 71bl for radix 56
Result recursive = 6bu4 for radix 57
Result recursive = 6Ivb for radix 58
Result recursive = 60cp for radix 59
Result recursive = 5gu7 for radix 60
Result recursive = 5Qln for radix 61
Result recursive = 5BAN for radix 62
Result recursive = 4x3J for radix 63
But as mentioned above: it segfaults instead. I pulled the code a bit apart to get one operation per line and run valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes ./divmod
which printed
==9546== Memcheck, a memory error detector
==9546== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9546== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==9546== Command: ./divmod
==9546==
==9546== Invalid read of size 4
==9546== at 0x4005FA: to_radix_recursive (divmod.c:41)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== Address 0x2 is not stack'd, malloc'd or (recently) free'd
==9546==
==9546==
==9546== Process terminating with default action of signal 11 (SIGSEGV)
==9546== Access not within mapped region at address 0x2
==9546== at 0x4005FA: to_radix_recursive (divmod.c:41)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== by 0x4005F1: to_radix_recursive (divmod.c:39)
==9546== If you believe this happened as a result of a stack
==9546== overflow in your program's main thread (unlikely but
==9546== possible), you can try to increase the size of the
==9546== main thread stack using the --main-stacksize= flag.
==9546== The main thread stack size used in this run was 8388608.
==9546==
==9546== HEAP SUMMARY:
==9546== in use at exit: 0 bytes in 0 blocks
==9546== total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==9546==
==9546== All heap blocks were freed -- no leaks are possible
Address 0x2
is very low and a hint for a failed pointer dereferencing which it is indeed.
The two calls to printf
are left for the reason you might have already guessed: if you use one (I needed to use both yesterday) it works. Which is almost always caused by some UB (undefined behaviour) somewhere.
With the faint fear that I made a really silly error and will get ridiculed by all of you: what is the cause of it and how can it be repaired?
The issue is that in your inline assembly you do:
__asm__("xorl %%edx, %%edx;"
"movl %2, %%eax;"
"movl %3, %%ebx;"
"divl %%ebx;"
: "=a"(q), "=d"(r)
: "g"(a), "g"(b)
);
GCC/CLANG are very unforgiving. If you modify a register you need to tell the compiler that it will be modified. In this inline assembly code you have said that EAX and EDX are output only registers (they will be modified) but you haven't told the compiler you modified/clobbered EBX. A simple fix for this is to add EBX to the clobber list like this:
__asm__("xorl %%edx, %%edx;"
"movl %2, %%eax;"
"movl %3, %%ebx;"
"divl %%ebx;"
: "=a"(q), "=d"(r)
: "g"(a), "g"(b)
: "ebx"
);
Now the compiler will not make any assumption about EBX still containing the same value as it did before the inline assembly code was run.
If your inline assembly starts with MOV
instructions you are probably taking the wrong approach by not using the inline assembly operands (and constraints) themselves to allow the compiler to try and generate the most efficient version of the code. Your inline assembly could have looked like this:
__asm__("divl %4"
: "=a"(q), "=d"(r)
: "a"(a), "d"(0), "r"(b)
);
We create a 5th operand to pass the divisor in a compiler chosen register. We also set EDX to zero in the operand rather than do it in the inline assembly. This version also reuses the EAX and EDX registers for both the input and output operands requiring fewer registers to be potentially used.