I am trying to write the more simple version of the Boyer Moore algorithm without prefix-function. It has to print all positions of symbols which have been compared with a pattern. And locally it passes the tests, but it fails when I commit it to the gitlab. I can't spot undefined behaviour here. I have garbage at the end.
#include <stdio.h>
#define MAX_PATTERN_LEN 16
#define BUF_SIZE 69
#define ALPH_SIZE 128
int read_str(int max_len, unsigned char *str_place) {
int count_read = 0;
for (int i = 0, ch; i < max_len; i++) {
if ((ch = getchar()) == '\n') {
str_place[i] = '\0';
break;
}
str_place[i] = (char)ch;
count_read++;
}
return count_read;
}
void calculate_shifts(const unsigned char *str, int len_str, int *badchar) {
for (int i = 0; i < ALPH_SIZE; i++)
badchar[i] = len_str;
for (int i = 0; i < len_str - 1; i++)
badchar[str[i]] = len_str - 1 - i;
}
void search(const unsigned char *str, const unsigned char *patt, int len_patt, int len_str) {
int badchar[ALPH_SIZE];
calculate_shifts(patt, len_patt, badchar);
int shift = 0;
while (shift <= (len_str - len_patt)) {
int j = len_patt - 1;
for (; j >= 0 && patt[j] == str[shift + j]; j--)
printf("%d ", shift + j + 1);
if (j < 0) {
shift += ((shift + len_patt) < len_str) ? badchar[patt[len_patt - 1]] : 1;
} else {
printf("%d ", shift + j + 1);
int shift_addition = badchar[str[shift + j]];
if ((shift_addition == len_patt) && (j < len_patt - 1) && (patt[len_patt - 1] == patt[0]))
shift_addition--;
shift += shift_addition;
}
}
}
int main(void) {
unsigned char str[BUF_SIZE + 1];
unsigned char patt[MAX_PATTERN_LEN + 1];
int len_patt = read_str(MAX_PATTERN_LEN + 1, patt);
int len_str = read_str(BUF_SIZE + 1, str);
if (!len_patt || !len_str)
return 0;
search(str, patt, len_patt, len_str);
return 0;
}
the test:
example
this is simple example
the correct output:
7 14 13 12 11 10 20 22 21 20 19 18 17 16
the actual output:
7 14 13 12 11 10 20 22 21 20 19 18 17 16 28 ..
The difference (extra number 28 at the end of the output) can be caused by a missing newline character (\n
) at the end of the last line of the input. I was able to reproduce both of your outputs (with and without the 28) locally.
$ gcc -g -O2 -W -Wall -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16
$ printf 'example\nthis is simple example; ' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16 28
TL;DR Fix the bug in your code by changing (ch = getchar()) == '\n'
to (ch = getchar()) == EOF || ch == '\n'
, and changing #define ALPH_SIZE 128
to #define ALPH_SIZE 256
.
Maybe there is another bug, I haven't checked. If you encounter any other bugs, please ask it in a separate question on StackOverflow.
The rest of this answer explains how I diagnosed this bug.
AddressSanitizer gcc -fsanitize=address
) reveals a memory access problem if the newline is missing:
$ gcc -g -O2 -W -Wall -fsanitize=address -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16
$ printf 'example\nthis is simple example; ' | ./t; echo
ASAN:DEADLYSIGNAL
=================================================================
==19294==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffe25fbe264 (pc 0x563d6542b204 bp 0x7ffe25fbe264 sp 0x7ffe9a1bbc70 T0)
==19294==The signal is caused by a READ memory access.
#0 0x563d6542b203 in search /tmp/t.c:33
#1 0x563d6542acb1 in main /tmp/t.c:54
#2 0x7f617ff96c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
#3 0x563d6542ad99 in _start (/tmp/t+0xd99)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/t.c:33 in search
==19294==ABORTING
Please note that I had to rerun it multiple times to trigger the AddressSanitizer error message.
Most of the time this is caused by a bug in your program, and often it's undefined behavior. The in search /tmp/t.c:33
clause in the output of AddressSanitizer indicates where in your source code the bad read happens (i.e. line 33, function search
).
I've added some extra printf
s to your code to see what happens in line 33. The output is:
$ printf 'example\nthis is simple example; ' | ./t; echo
patt[6] == 101
str[6] == 115
patt[6] == 101
str[13] == 101
patt[5] == 108
str[12] == 108
patt[4] == 112
str[11] == 112
patt[3] == 109
str[10] == 109
patt[2] == 97
str[9] == 105
patt[6] == 101
str[19] == 112
patt[6] == 101
str[21] == 101
patt[5] == 108
str[20] == 108
patt[4] == 112
str[19] == 112
patt[3] == 109
str[18] == 109
patt[2] == 97
str[17] == 97
patt[1] == 120
str[16] == 120
patt[0] == 101
str[15] == 101
patt[6] == 101
str[27] == 255
patt[6] == 101
str[-369011759] == ??
ASAN:DEADLYSIGNAL
=================================================================
==19906==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffc6e5c0c71 (pc 0x5621e2363094 bp 0x0000ea0153d1 sp 0x7ffc845ab540 T0)
==19906==The signal is caused by a READ memory access.
#0 0x5621e2363093 in gett /tmp/t.c:33
#1 0x5621e236349b in search /tmp/t.c:42
#2 0x5621e2362e61 in main /tmp/t.c:63
#3 0x7f7063ee8c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
#4 0x5621e2362f49 in _start (/tmp/t+0xf49)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/t.c:33 in gett
==19906==ABORTING
The culprit is the last read access: str[-369011759] == ??
, this indicates that shift + j
becomes -369011759, which is obviously incorrect, because correct values are small nonnegative integers.
Please note that I had to rerun it multiple times to trigger the Address Sanitizer error message, and the number -369011759 was different each time.
It looks like j
has a sane value of 6. To figure out the actual bug, the next step is checking where shift
gets its large negative value.
At this point the output line str[27] == 255
was suspicious to me. I checked your code whether it handles such large values in str
correctly, and then I found a bug.
One way to fix this newfound bug is changing ALPH_SIZE
from 128
to 256
. (I've verified that it makes the Address Sanitizer errors disappear.) Here is why. Line 39 contains the read of badchar[str[shift + j]]
. For that to be valid, we need 0 <= str[shift + j] && str[shift + j] < ALPH_SIZE
, because ALPH_SIZE
is the size of the badchar
array. However, the maximum value for str[...]
is 255 (because it's an unsigned char), thus we need ALPH_SIZE >= 256
.
Indeed, an out-of-bounds array access (read or write) is undefined behavior in C. To prove that this is actually happening, I've changed line 39 to
printf("%d ", shift + j + 1); fprintf(stderr, "RBC %d == %d\n", shift + j, str[shift + j]);
, and I've also changed line 28 to use malloc:
int *badchar = malloc(sizeof(int) * ALPH_SIZE);
, and I've also added free(badchar)
later. With all that I've got:
$ gcc -g -O2 -W -Wall -include stdlib.h -fsanitize=address -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
RBC 6 == 115
RBC 9 == 105
RBC 19 == 112
7 14 13 12 11 10 20 22 21 20 19 18 17 16
$ printf 'example\nthis is simple example; ' | ./t; echo
RBC 6 == 115
RBC 9 == 105
RBC 19 == 112
RBC 27 == 255
=================================================================
==23463==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61500000047c at pc 0x563915b3754a bp 0x7fff5f340710 sp 0x7fff5f340700
READ of size 4 at 0x61500000047c thread T0
#0 0x563915b37549 in search /tmp/t.c:39
#1 0x563915b36dc1 in main /tmp/t.c:54
#2 0x7f1a63d6dc86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
#3 0x563915b36ea9 in _start (/tmp/t+0xea9)
Address 0x61500000047c is a wild pointer.
SUMMARY: AddressSanitizer: heap-buffer-overflow /tmp/t.c:39 in search
Shadow bytes around the buggy address:
0x0c2a7fff8030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2a7fff8040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c2a7fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff8070: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c2a7fff8080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]
0x0c2a7fff8090: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff80a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff80b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff80c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c2a7fff80d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==23463==ABORTING
Indeed, the output of AddressSanitizer indicates the bug in line 39, and the output of the extra fprintf
calls confirm that it's an out-of-bounds array access: reading of element 255 of the array badchar
, which has only 128 elements.
How did we end up with str[27] == 255
(thus reading badchar[255]
)? That's easy: if there is no terminating \n
at the end of the file when reading str
, then getchar()
returns EOF, which becomes 255 when assigned to an unsigned char
. (The obvious fix, as above, is to stop at EOF instead, and change ALPH_SIZE
to 256.)
Why hasn't AddressSanitizer reported the read of out-of-bounds array index problem (as a stack-buffer-overflow) in line 39 before badchar
was changed to use malloc
(reported as heap-buffer-overflow)? I don't quite understand this yet. AddressSanitizer should be able to report stack-buffer-overflow reliably. Probably it depends on the GCC and Clang version, because changing the command gcc
to clang
made the stack-buffer-overflow error appear reliably. Probably upgrading to the latest GCC and Clang would help. Valgrind is not as reliable for detecting stack-buffer-overflow reads as AddressSanitizer.