Search code examples
data-structuresb-treered-black-treefile-mappinglarge-data

Red Black Tree versus B Tree


I have a project in which I have to achieve fast search, insert and delete operations on data ranging from megabytes to terabytes. I had been studying data structures of late and analyzing them. Being specific, I want to introduce 3 cases and ask questions on that:

  1. The data is much more than what the memory can handle (sample ranges in 10-15 terabytes) at one go. In this case, I would store the data structure on a disk.

  2. The data is relatively less compared to the memory of the system and thus it can be stored and operated in the memory itself for speed.

  3. The data is more than free memory and assume it is less than the size of a possible contiguous chunk of data in the paging file. thus I store the data structure in a file on disk and do a memory mapping of the file.

The conclusions I have drawn are:

For case 1, I should use a B-Tree for faster access as it saves on lag produced by disk rotation

For case 2, I should use a Red Black Tree for faster access as data is on memory and no. of elements needed to be scanned in worse case would be lesser than one i have to do if I use B Trees

For case 3, I am doubtful on this one, the page file is on disk uses native OS I/O to operate on files, so should B Tree be a better option or a Red Black tree?

I want to know where the above three conclusions go right and where it goes wrong and how I can improve on performance in the three separate cases.

I am using the C++ Language, with a red black tree and a B tree, both which I have designed from scratch. I am using Boost library for File mapping.

Update 1:: Was reading through this post in stackoverflow. Got some real good insights, which make me feel that the type of comparisons I have done in the cases may be faulty. A link was posted in the most-voted-for-answer http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html


Solution

  • A red/black tree is more or less equivalent to a 2-3-4 tree, which is just a type of B-tree. The worst-case performance is identical, provided you do a binary search of the B-tree node values.

    The obvious disadvantage of a B-tree is wasted space, but depending on the language/memory allocator used, you may find that a 2-3-4 tree uses less space than a red-black tree on average. For instance, in 32-bit Java, there's approximately an 8-byte overhead per object. (It also depends a lot on the allocator; IIRC phkmalloc rounds up small allocations to a power-of-2 size.)

    To answer your cases,

    1. Disk latency is roughly evenly split between seek time and waiting for the disk to rotate.
    2. A B-tree should be able to outperform a red-black tree if you're doing it right (in particular, a B-tree should be faster if nodes fit into a cacheline.)
    3. It doesn't need to be contiguous in the page file; it merely needs to be contiguous in the process's virtual address space. For sane OSes, it's also pretty much identical to case 1, unless your data is small enough that it mostly fits into memory and the memcpy overhead is significant.

    For simplicity, I'd go with a B-tree and run some benchmarks on various node sizes.