How to search in large (up to 50GB) sorted binary file using Python?

The binary data file looks like this (string form of the binary content) '50.134|50.135|180.453|180.473|191.001|191.001...3000.3453', ~1B values in total.

Query: find offsets (indices) of the first value x1 >= 200.03 and the last value x2 <= 200.59.
Read: read the values between x1 and x2, ~1k values.

Ideally, the querying and reading shouldn't take longer than 200ms. The file cannot be hold in memory but rather on a disk (or even AWS S3).

What I have come up with so far. The file is split into chunks (e.g 5MB). The chunk's first and last values are stored in an index that is used to find corresponding chunks for a query. Next, the chunks are read into memory and in memory search is performed.

I'd be glad to hear about how others would approach the problem.

Thanks for your help!

Solution

Here is an example (pseudo code, not tested) of how you could build a partial index of entries in your binary file that will allow you to access subranges efficiently without loading the whole file in memory and maximizing sequential reads:

import bisect

recordSize  = 16   # size in bytes of one record in the file
chunkSize   = 1024 # groups of 1K records (in number of records)
chunkIndex  = []   # indexing value of first record of chunk (for each chunk)

with open("testFile", "rb") as binaryFile:

    # build the partial index (chunkIndex) - only done once
    binaryFile.seek(0, 2)
    fileSize = binaryFile.tell()
    for position in range(0, fileSize, chunkSize * recordSize):
        binaryFile.seek(position)
        record     = binaryFile.read(recordSize)
        # use your own record/binary format conversion here
        chunkValue = int.from_bytes(record[:4],byteorder="little", signed=False)
        chunkIndex.append(chunkValue)


    # to access a range of records with values between A an B:
    firstChunk = bisect_left(chunkIndex,A) # chunk that will contain start value
    position   = firstChunk * chunksize * recordSize
    binaryFile.seek(position)
    while not binaryFile.eof: 
        records     = binaryFile.read(recordSize*chunkSize) # sequential read.
        for i in range(0,len(records),recordSize):
            record = records[i:i+recordSize)
            # use your own record/binary format conversion here
            value = int.from_bytes(record[:4],byteorder="little", signed=False)
            if value < A : continue
            if value > B : break
            # Process record here ...
        if value > B : break

You will need to play with the value of chunkSize to find a sweet spot that balances load-time/memory usage with data access time. Since your ranges will not always fall on chunk boundaries, in the worse case scenario, you may end up reading records that you don't want and have to skip over. On average you will be reading chunkSize/2 unnecessary records. This is where the difference in performance between sequential vs random access can pay off.

On a network drive, random access will be impacted by latency and sequential access is a function of bandwidth. In other words more requests require more round trip to the server (latency) and reading bigger chunks require more packets (bandwidth).

If you are using a HDD (or network drive), the sequential reads of multiple adjacent records will tend to be much faster (at least 20x) than random access and you should get some benefits from this partial indexing.
However, if your file is on an internal SSD, then a standard binary search directly in the file (without memory indexing) will perform faster.

With 1 Billion records, finding the first record's position would require 30 seek/read operations (2^30 > 1B). This means that, if you keep 16M entries in the chunk index, each chunk will correspond to 64 records. With 16 million keys in memory, you would save 24 out of the 30 seek/read operations that a straight binary search would need. This would be at the cost of 32 (on average) unnecessary sequential reads.

You may also chose to implement a hybrid of the two approaches to minimize disk access (i.e. use the partial index to find the chunk range and then binary search to pinpoint the exact position of the first record within the starting chunk). This would require only 6 seek/read operations to pinpoint the first record in the 64 record range indicated by the in-memory partial index.

In both approaches, once you've found the first record, the rest of the range is sequential reads from there until you reach the end of the range or end of file. If you expect to be reading the same records more than once, it may be possible to further optimize by keeping a cache of record ranges that you've read before and use it to get the data without going back to disk (e.g. by skipping over record reads that you have in the cache when you read sequentially)