Search code examples
cstringmmap

strstr on huge mmapped file


I open huge (11Gb) file, mmap it to memmory, and fail to search the string in the file

my code is

 if ( (fd  = open("l", O_RDONLY)) < 0 )     err_sys("Cant open file");
 if ( fstat(fd, &statbuf) < 0 )             err_sys("Cant get file size");
 printf("size is %ld\n", statbuf.st_size);

 if ( (src = mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED, fd, 0)) == MAP_FAILED )  err_sys("Cant mmap");
 printf("src pointer is at %ld\n", src);

 char * index = strstr(src, "bin/bash");
 printf("needle is at %ld\n", index);

It works on small files, but on huge sources returns 0. What function should I use to search in huge mmapped files?

The output is:

size is 11111745740
src pointer is at 140357526544384
needle is at 0

Solution

  • You should not use strstr() to search for text in a memory mapped file:

    • If the file is binary, it most likely contains null bytes that will stop the search too soon. This is probably what you observe.
    • If the file is pure text, but does not contain a match, strstr will keep scanning beyond the end of the file, invoking undefined behavior by attempting to read unmapped memory.

    You could instead use a function with equivalent semantics but applied to raw memory instead of C strings, memmem(), available on Linux and BSD systems:

    void *memmem(const void *p1, size_t size1, const void *p2, size_t size2);
    

    Note that you also use the wrong printf formats: it should be %p for src and index and you might prefer to print the offset as a ptrdiff_t or an unsigned long long:

        if ((fd  = open("l", O_RDONLY)) < 0)
            err_sys("Cannot open file");
        if (fstat(fd, &statbuf) < 0)
            err_sys("Cannot get file size");
    
        printf("size is %llu\n", (unsigned long long)statbuf.st_size);
    
        if ((src = mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED, fd, 0)) == MAP_FAILED)
            err_sys("Cannot mmap");
    
        printf("src pointer is at %p\n", (void*)src);
    
        char *index = memmem(src, statbuf.st_size, "bin/bash", strlen("bin/bash"));
        printf("needle is at %p\n", (void*)index);
    
        if (index != NULL)
            printf("needle is at offset %llu\n", (unsigned long long)(index - src));
    

    If memmem is not available on your platform, here is a simple implementation:

    #include <string.h>
    
    void *memmem(const void *haystack, size_t n1, const void *needle, size_t n2) {
        const unsigned char *p1 = haystack;
        const unsigned char *p2 = needle;
    
        if (n2 == 0)
            return (void*)p1;
        if (n2 > n1)
            return NULL;
    
        const unsigned char *p3 = p1 + n1 - n2 + 1;
        for (const unsigned char *p = p1; (p = memchr(p, *p2, p3 - p)) != NULL; p++) {
            if (!memcmp(p, p2, n2))
                return (void*)p;
        }
    
        return NULL;
    }