Search code examples
cfindbrute-force

Brute-forcing find FILE* C


I have been finding a way to brute-force finding a int64_t in a file in C. I have written the following code.

int64_t readbyte = 0, totalreadbytes = 0;
int64_t totalfound = 0;
const int64_t magic = MAGIC_NUMBER;

char *buffer = (char *)malloc(BUFFER_SIZE);
int64_t *offsets = (int64_t *)malloc(sizeof(int64_t) * (1 << 24));
if (buffer == NULL || offsets == NULL)
{
    return -3;
}

while ((readbyte = fread(buffer, 1, BUFFER_SIZE, inptr)) > 0)
{
    for (int i = 0; i <= readbyte - 8; i++)
    {
        if (memcmp(buffer + i, &magic, sizeof(magic))==0)
        {
            offsets[totalfound++] = totalreadbytes + i;
        }
    }

    totalreadbytes += readbyte - 8;
    fseek(inptr, -8, SEEK_CUR);
}

// Do something to those offsets found

free(offsets);
free(buffer);

I have been wondering if there is a way better to find that int64_t, because my goal is to find them in a file as large as 60gigs and there maybe several hundred thousands of them in that file


Solution

  • Backing up and re-reading data is going to slow things down quite a bit.

    Building on @melpomene comment, here's a very simple way to do it with mmap():

    uint64_t needle;
    
    struct stat sb;
    int fd = open( filename, O_RDONLY );
    fstat( fd, &sb );
    
    unsigned char *haystack = mmap( NULL, sb.st_size,
        PROT_READ, MAP_PRIVATE, fd, 0 );
    
    close( fd );
    
    off_t bytesToSearch = sb.st_size - sizeof( needle );
    
    // <= so the last bytes get searched
    for ( off_t ii = 0; ii <= bytesToSearch; ii++ )
    {
        if ( 0 == memcmp( haystack + ii, &needle, sizeof( needle ) ) )
        {
             // found it!
        }
    }
    

    Error checking and proper headers omitted for clarity.

    There are a lot of ways to improve the performance of that. This IO pattern is the worst possible use of mmap() with regards to performance - read every byte in the file just once, then throw the mappings away. Because mapping a file isn't all that fast in the first place, and it impacts the entire machine.

    It'd probably be a lot faster to just use open() and read() with direct IO in large page-sized chunks into page-aligned memory, especially if the file is a significant fraction of the system's RAM. But that would make the code much more complex, as the comparisons would have to span buffers - it's almost certainly much faster to use two buffers and copy a few bytes out to search across a break between buffers than it is to back up and do a non-aligned read.