algorithm binary-tree text-editor red-black-tree

Red-black tree for editing text

I was reading this great article by Joaquin Cuenca Abela. He talks about using a red-black tree to implement a piece table, rather than a doubly linked list.

I'm having some trouble understanding how this might relate to a buffer undergoing changes. For example, take these two buffers (original, append):

Hello!\0    Original
y           Append

And let's say the piece table looks like this:

buffer    start    length
original  0        2
original  5        2
append    0        1

We should end up with:

Hey!\0

Using a doubly linked list, this could be implemented kind of like this:

------------------   -------------------   ----------------
Buffer = ORIGINAL|   |Buffer = ORIGINAL|   |Buffer = APPEND
Start  = 0       |<--|Start  = 5       |<--|Start  = 0
Length = 2       |   |Length = 2       |   |Length = 1
Next   = 0x01    |-->|Next   = 0x02    |-->|Next   = NULL
Previous = NULL  |   |Previous = 0x01  |   |Previous = 0x01
------------------   -------------------   ----------------

If the file is originally loaded as a char array, or something, this seems really trivial and fast to run through after an edit. On the other hand, as far as I understand it, the red-black tree would look like this:

                       ------------------
                       size       = 2
                       size_left  = 1
                       size_right = 2
                       colour     = black
                       ------------------
                        /              \
                       /                \
                      /                  \
             ----------------      ---------------- 
             size       = 1        size       = 2
             size_left  = 0        size_left  = 0
             size_right = 0        size_right = 0
             colour     = red      colour     = red
             ----------------      ----------------
             /          \              /           \
            /            \            /             \
         NULL            NULL       NULL           NULL

I don't see a clear way to repaint the rest of the document after an edit. Sure, insert/delete/lookup would be faster for adding pieces to the tree. But I'm missing how one would construct the edited buffer for viewing.

What am I missing? If I had an editor, and I deleted/inserted a chunk of text, how would I traverse the tree to repaint the buffer and correctly reflect this edit? And, how would this be faster than the O(n) time complexity offered by the linked list?

Solution

I don't really understand the diagram you provide of the tree, because (unlike the linked list diagram) they seem to have no relation to the actual stored data. In practice, they will have essentially the same data fields (Buffer, Start and Length) plus one more, Size, which is the total size of the pieces in the subtree headed by the node. Instead of Previous and Next pointers, they will have Left and Right (child) pointers.

And, of course, they will have some extra data needed in order to maintain the tree balanced (a red/black bit in the case of red/black trees, but I don't think the mechanism for maintaining balance is important here; you could use AVL trees instead of red/black trees, for example. So I'm going to ignore that part of the node here.

The Size field is necessary in order to find the data at a given offset (and consequently, could be left out if there was never any need to do such a lookup.) I think the linked article measures Size in pieces, while I tend to measure Size in characters (or even bytes), which is what I will illustrate here. As the linked article notes, the Size field can easily be maintained in logarithmic time precisely because it refers to the Size of the subtree, rather than its location in the data stream.

You use the Size field to lookup a node by buffer offset. If the offset is less than the Size of the left child, you recurse into the left child; if it is at least the current length plus the Size of the left child, you subtract that sum from the offset and recurse into the right child. Otherwise, the current node contains the desired offset. This cannot take longer than the maximum tree depth, which is O(log N) if the tree is reasonably balanced.

I was also a bit confused by your linked list diagram, which seems to me to be representing the buffer He|!\0|y, while I would expect it to be He|y|!\0:

------------------   -------------------   -------------------
Buffer = ORIGINAL|   |Buffer = APPEND  |   |Buffer = ORIGINAL|
Start  = 0       |<--|Start  = 0       |<--|Start  = 5       |
Length = 2       |   |Length = 1       |   |Length = 2       |
Next   = 0x01    |-->|Next   = 0x02    |-->|Next   = NULL    |
Previous = NULL  |   |Previous = 0x01  |   |Previous = 0x01  |
------------------   -------------------   -------------------

The equivalent balanced tree is:

                       -------------------
                       | Size   = 5      |
                       | Buffer = APPEND |
                       | Start  = 0      |
                       | Length = 1      |
                       -------------------
                        /              \
                       /                \
                      /                  \
             -------------------   ------------------- 
             |Size   = 2       |   |Size   = 2       |
             |Buffer = ORIGINAL|   |Buffer = ORIGINAL|
             |Start  = 0       |   |Start  = 5       |
             |Length = 2       |   |Length = 2       |
             -------------------   -------------------
               /          \            /           \
              /            \          /             \
            NULL            NULL    NULL           NULL

The algorithm for finding the next node in order from a given node is as follows:

While the right child pointer is NULL, return to the parent. Continue to move to the parent until a node is found whose right child pointer is neither NULL nor the child being returned from.
Move to the right child.
While the left child pointer is not NULL, move to the left child

A given application of this algorithm could obviously take O(log N) iterations of steps 1 and/or 3. However, repeated applications (as in the case when you are traversing the buffer in order through a number of nodes) will be linear in total because any given link (parent ⇆ child) will be traversed exactly twice, once in each direction. So the total number of links traversed if walking the entire tree is twice the number of links in the tree. (And the tree has one fewer link than nodes, since it is a tree.)

If you measure size in characters, then you can avoid the need for the "Length" field, because the length of the data directly referred to by a Node is simply the difference between the node's subtree size and the sum of its children subtree sizes. That can (almost) reduce the size of a node to the size of the linked-list node, assuming you can find some way to encode the red/black bit (or other balancing information) in what would otherwise be padding bits.

On the other hand, it is somewhat common to see binary tree implementations with parent pointers, as well as two child pointers. (It's clear how this can help by looking at the traverse algorithm above.) However, it is not necessary to store parent pointers because they can, for example, be maintained during any given traverse of a tree in an array of parent pointers indexed by depth in the tree. This array is obviously no larger than the maximum tree depth, so a small (~50) fixed-length array could be used.

These optimisations are also well beyond this answer.