Search code examples
algorithmsortingoptimizationsearchbinary-search

Optimized on-disk data structure for search with minimal random accesses


I have a huge file (~16TB) that contains a list of 8-byte keys mapped to 8-byte values. (2^40 pairs, 16 bytes each).

I would now like to optimize the file so that I can search it efficiently. I have currently sorted the file and perform a binary search on it. This completes in 30 reads, but these reads are highly distributed around the file, especially at the start.

I am aware I can load the entire binary search partition that's left after 10 steps into 16GB memory and continue the search there. However, I only have a negligible amount of memory available, so this isn't an option.

Is there a way to arrange the data on disk so that the accesses required to search the file are close together from the start? This would allow me to load an entire "range" of values that need to be read into memory, to reduce the number of read calls in total, and reduce the number of random accesses.

After initial construction, the file never changes, so insertion and deletion is not relevant and building any kind of index is allowed to take a long time. Also, the keys are (roughly) uniformly distributed across the 2^64 space.


Solution

  • The best solution to your problem is not the arrangement of the data on the disk, but an index that can be used to look up the address of the key on the disk. Maintaining such an index can reduce the amount of read operations from 30 to 2. Bam.

    A 32 bit index is best suited for your situation. Imagine we split up the 64 bit space that your keys span into 2^32 buckets. Then, each of these buckets could hold 2^32 keys, since 2^32 * 2^32 = 2^64. However, since you only have 2^40 items in total (hopefully) uniformly distributed in the 2^64 space, each of these buckets would only need to hold 2^8 number of items in average, since 2^32 * 2^8 = 2^40. 2^8 is just 256, times 16 bytes per item yields 4 kilobytes of data per bucket that you could load with one read operation from the disk and then search those 256 items in RAM for the correct one. Surely you have 4KB memory to spare? Otherwise you must have like a pocket calculator with a petabyte-sized disk attached. If you really have no memory beyond the minimum of 16 bytes you need to store a single item, then you would still need to binary search those 256 items and you would at least be down to 9 disk reads instead of 30 (one for the index lookup and 8 for the binary search) and these would even be relatively close to each other.

    So, the procedure is as follows.

    First, create a hash(key) (read: hash of key) function like so:

    hash(key) = take the 32 most significant bits of key
    

    You can accomplish this by using a bit mask. Lightning fast. You will find that a bunch of your keys map to the same hash, but that's okay. Each of these hash values will be the address of a piece of information (disk_address, items) that stores 1. the address of the bucket on disk, and 2. the number of items in the same bucket (which could be more or less than 256).

    Then, sort your data by key as you already have, and write your data to the disk sequentially back to back, thereby also building the index by remembering the address of every bucket on the disk and the number of data items that land in the bucket.

    i = 0
    while (still have data left)
    {
        disk_address = writeData(data[i])
        if (index[hash(key[i])] == (0,0))
            index[hash(key[i])] = (disk_address, 1);
        else
            index[hash(key[i])] = (disk_address, index[hash(key[i])].items + 1);
        i++
    }
    

    The size of the index would be 2^32 * (64 bit for the address + 16 bit for the number of items) = 40 GB, which is a tiny amount (0.0025%) compared to the 16 TB you have, and the index can be stored on disk.

    Finally, to look up the value identified by a given key:

    (disk_address, items) = load_from_disk(index[hash(key)]) // disk read 1
    data_items[items]
    data_items = disk_read(disk_address, items * 16 bytes) // disk read 2
    search_items_for_key(data_items, key)
    

    Of course this is all just pseudo code and I leave the implementation to you. :)