Search code examples
cundefined-behaviorboyer-moore

Undefined behaviour in a simple version of the Boyer Moore algorithm


I am trying to write the more simple version of the Boyer Moore algorithm without prefix-function. It has to print all positions of symbols which have been compared with a pattern. And locally it passes the tests, but it fails when I commit it to the gitlab. I can't spot undefined behaviour here. I have garbage at the end.

#include <stdio.h>

#define MAX_PATTERN_LEN 16
#define BUF_SIZE 69
#define ALPH_SIZE 128

int read_str(int max_len, unsigned char *str_place) {
    int count_read = 0;
    for (int i = 0, ch; i < max_len; i++) {
        if ((ch = getchar()) == '\n') {
            str_place[i] = '\0';
            break;
        }
        str_place[i] = (char)ch;
        count_read++;
    }
    return count_read;
}

void calculate_shifts(const unsigned char *str, int len_str, int *badchar) {
    for (int i = 0; i < ALPH_SIZE; i++)
        badchar[i] = len_str;
    for (int i = 0; i < len_str - 1; i++)
        badchar[str[i]] = len_str - 1 - i;
}

void search(const unsigned char *str, const unsigned char *patt, int len_patt, int len_str) {
    int badchar[ALPH_SIZE];
    calculate_shifts(patt, len_patt, badchar);
    int shift = 0;
    while (shift <= (len_str - len_patt)) {
        int j = len_patt - 1;
        for (; j >= 0 && patt[j] == str[shift + j]; j--)
            printf("%d ", shift + j + 1);
        if (j < 0) {
            shift += ((shift + len_patt) < len_str) ? badchar[patt[len_patt - 1]] : 1;
        } else {
            printf("%d ", shift + j + 1);
            int shift_addition = badchar[str[shift + j]];
            if ((shift_addition == len_patt) && (j < len_patt - 1) && (patt[len_patt - 1] == patt[0]))
                shift_addition--;
            shift += shift_addition;
        }
    }
}

int main(void) {
    unsigned char str[BUF_SIZE + 1];
    unsigned char patt[MAX_PATTERN_LEN + 1];
    int len_patt = read_str(MAX_PATTERN_LEN + 1, patt);
    int len_str = read_str(BUF_SIZE + 1, str);
    if (!len_patt || !len_str)
        return 0;
    search(str, patt, len_patt, len_str);
    return 0;
} 

the test:

example
this is simple example

the correct output:

7 14 13 12 11 10 20 22 21 20 19 18 17 16

the actual output:

7 14 13 12 11 10 20 22 21 20 19 18 17 16 28 ..

Solution

  • The difference (extra number 28 at the end of the output) can be caused by a missing newline character (\n) at the end of the last line of the input. I was able to reproduce both of your outputs (with and without the 28) locally.

    $ gcc -g -O2 -W -Wall -o t t.c
    $ printf 'example\nthis is simple example; \n' | ./t; echo
    7 14 13 12 11 10 20 22 21 20 19 18 17 16
    $ printf 'example\nthis is simple example; ' | ./t; echo
    7 14 13 12 11 10 20 22 21 20 19 18 17 16 28 
    

    TL;DR Fix the bug in your code by changing (ch = getchar()) == '\n' to (ch = getchar()) == EOF || ch == '\n', and changing #define ALPH_SIZE 128 to #define ALPH_SIZE 256.

    Maybe there is another bug, I haven't checked. If you encounter any other bugs, please ask it in a separate question on StackOverflow.


    The rest of this answer explains how I diagnosed this bug.

    AddressSanitizer gcc -fsanitize=address) reveals a memory access problem if the newline is missing:

    $ gcc -g -O2 -W -Wall -fsanitize=address -o t t.c
    $ printf 'example\nthis is simple example; \n' | ./t; echo
    7 14 13 12 11 10 20 22 21 20 19 18 17 16 
    $ printf 'example\nthis is simple example; ' | ./t; echo
    ASAN:DEADLYSIGNAL
    =================================================================
    ==19294==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffe25fbe264 (pc 0x563d6542b204 bp 0x7ffe25fbe264 sp 0x7ffe9a1bbc70 T0)
    ==19294==The signal is caused by a READ memory access.
        #0 0x563d6542b203 in search /tmp/t.c:33
        #1 0x563d6542acb1 in main /tmp/t.c:54
        #2 0x7f617ff96c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
        #3 0x563d6542ad99 in _start (/tmp/t+0xd99)
    
    AddressSanitizer can not provide additional info.
    SUMMARY: AddressSanitizer: SEGV /tmp/t.c:33 in search
    ==19294==ABORTING
    

    Please note that I had to rerun it multiple times to trigger the AddressSanitizer error message.

    Most of the time this is caused by a bug in your program, and often it's undefined behavior. The in search /tmp/t.c:33 clause in the output of AddressSanitizer indicates where in your source code the bad read happens (i.e. line 33, function search).

    I've added some extra printfs to your code to see what happens in line 33. The output is:

    $ printf 'example\nthis is simple example; ' | ./t; echo
    patt[6] == 101
    str[6] == 115
    patt[6] == 101
    str[13] == 101
    patt[5] == 108
    str[12] == 108
    patt[4] == 112
    str[11] == 112
    patt[3] == 109
    str[10] == 109
    patt[2] == 97
    str[9] == 105
    patt[6] == 101
    str[19] == 112
    patt[6] == 101
    str[21] == 101
    patt[5] == 108
    str[20] == 108
    patt[4] == 112
    str[19] == 112
    patt[3] == 109
    str[18] == 109
    patt[2] == 97
    str[17] == 97
    patt[1] == 120
    str[16] == 120
    patt[0] == 101
    str[15] == 101
    patt[6] == 101
    str[27] == 255
    patt[6] == 101
    str[-369011759] == ??
    ASAN:DEADLYSIGNAL
    =================================================================
    ==19906==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffc6e5c0c71 (pc 0x5621e2363094 bp 0x0000ea0153d1 sp 0x7ffc845ab540 T0)
    ==19906==The signal is caused by a READ memory access.
        #0 0x5621e2363093 in gett /tmp/t.c:33
        #1 0x5621e236349b in search /tmp/t.c:42
        #2 0x5621e2362e61 in main /tmp/t.c:63
        #3 0x7f7063ee8c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
        #4 0x5621e2362f49 in _start (/tmp/t+0xf49)
    
    AddressSanitizer can not provide additional info.
    SUMMARY: AddressSanitizer: SEGV /tmp/t.c:33 in gett
    ==19906==ABORTING
    

    The culprit is the last read access: str[-369011759] == ??, this indicates that shift + j becomes -369011759, which is obviously incorrect, because correct values are small nonnegative integers.

    Please note that I had to rerun it multiple times to trigger the Address Sanitizer error message, and the number -369011759 was different each time.

    It looks like j has a sane value of 6. To figure out the actual bug, the next step is checking where shift gets its large negative value.

    At this point the output line str[27] == 255 was suspicious to me. I checked your code whether it handles such large values in str correctly, and then I found a bug.

    One way to fix this newfound bug is changing ALPH_SIZE from 128 to 256. (I've verified that it makes the Address Sanitizer errors disappear.) Here is why. Line 39 contains the read of badchar[str[shift + j]]. For that to be valid, we need 0 <= str[shift + j] && str[shift + j] < ALPH_SIZE, because ALPH_SIZE is the size of the badchar array. However, the maximum value for str[...] is 255 (because it's an unsigned char), thus we need ALPH_SIZE >= 256.

    Indeed, an out-of-bounds array access (read or write) is undefined behavior in C. To prove that this is actually happening, I've changed line 39 to

            printf("%d ", shift + j + 1); fprintf(stderr, "RBC %d == %d\n", shift + j, str[shift + j]);
    

    , and I've also changed line 28 to use malloc:

        int *badchar = malloc(sizeof(int) * ALPH_SIZE);
    

    , and I've also added free(badchar) later. With all that I've got:

    $ gcc -g -O2 -W -Wall -include stdlib.h -fsanitize=address -o t t.c
    $ printf 'example\nthis is simple example; \n' | ./t; echo
    RBC 6 == 115
    RBC 9 == 105
    RBC 19 == 112
    7 14 13 12 11 10 20 22 21 20 19 18 17 16 
    $ printf 'example\nthis is simple example; ' | ./t; echo
    RBC 6 == 115
    RBC 9 == 105
    RBC 19 == 112
    RBC 27 == 255
    =================================================================
    ==23463==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61500000047c at pc 0x563915b3754a bp 0x7fff5f340710 sp 0x7fff5f340700
    READ of size 4 at 0x61500000047c thread T0
        #0 0x563915b37549 in search /tmp/t.c:39
        #1 0x563915b36dc1 in main /tmp/t.c:54
        #2 0x7f1a63d6dc86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
        #3 0x563915b36ea9 in _start (/tmp/t+0xea9)
    
    Address 0x61500000047c is a wild pointer.
    SUMMARY: AddressSanitizer: heap-buffer-overflow /tmp/t.c:39 in search
    Shadow bytes around the buggy address:
      0x0c2a7fff8030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x0c2a7fff8040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x0c2a7fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
      0x0c2a7fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
      0x0c2a7fff8070: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    =>0x0c2a7fff8080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]
      0x0c2a7fff8090: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
      0x0c2a7fff80a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
      0x0c2a7fff80b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
      0x0c2a7fff80c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
      0x0c2a7fff80d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
    Shadow byte legend (one shadow byte represents 8 application bytes):
      Addressable:           00
      Partially addressable: 01 02 03 04 05 06 07 
      Heap left redzone:       fa
      Freed heap region:       fd
      Stack left redzone:      f1
      Stack mid redzone:       f2
      Stack right redzone:     f3
      Stack after return:      f5
      Stack use after scope:   f8
      Global redzone:          f9
      Global init order:       f6
      Poisoned by user:        f7
      Container overflow:      fc
      Array cookie:            ac
      Intra object redzone:    bb
      ASan internal:           fe
      Left alloca redzone:     ca
      Right alloca redzone:    cb
    ==23463==ABORTING
    

    Indeed, the output of AddressSanitizer indicates the bug in line 39, and the output of the extra fprintf calls confirm that it's an out-of-bounds array access: reading of element 255 of the array badchar, which has only 128 elements.

    How did we end up with str[27] == 255 (thus reading badchar[255])? That's easy: if there is no terminating \n at the end of the file when reading str, then getchar() returns EOF, which becomes 255 when assigned to an unsigned char. (The obvious fix, as above, is to stop at EOF instead, and change ALPH_SIZE to 256.)

    Why hasn't AddressSanitizer reported the read of out-of-bounds array index problem (as a stack-buffer-overflow) in line 39 before badchar was changed to use malloc (reported as heap-buffer-overflow)? I don't quite understand this yet. AddressSanitizer should be able to report stack-buffer-overflow reliably. Probably it depends on the GCC and Clang version, because changing the command gcc to clang made the stack-buffer-overflow error appear reliably. Probably upgrading to the latest GCC and Clang would help. Valgrind is not as reliable for detecting stack-buffer-overflow reads as AddressSanitizer.