Search code examples
c#performancedata-structurestreap

Why is insertion into my tree faster on sorted input than random input?


Now I've always heard binary search trees are faster to build from randomly selected data than ordered data, simply because ordered data requires explicit rebalancing to keep the tree height at a minimum.

Recently I implemented an immutable treap, a special kind of binary search tree which uses randomization to keep itself relatively balanced. In contrast to what I expected, I found I can consistently build a treap about 2x faster and generally better balanced from ordered data than unordered data -- and I have no idea why.

Here's my treap implementation:

And here's a test program:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;

namespace ConsoleApplication1
{

    class Program
    {
        static Random rnd = new Random();
        const int ITERATION_COUNT = 20;

        static void Main(string[] args)
        {
            List<double> rndTimes = new List<double>();
            List<double> orderedTimes = new List<double>();

            rndTimes.Add(TimeIt(50, RandomInsert));
            rndTimes.Add(TimeIt(100, RandomInsert));
            rndTimes.Add(TimeIt(200, RandomInsert));
            rndTimes.Add(TimeIt(400, RandomInsert));
            rndTimes.Add(TimeIt(800, RandomInsert));
            rndTimes.Add(TimeIt(1000, RandomInsert));
            rndTimes.Add(TimeIt(2000, RandomInsert));
            rndTimes.Add(TimeIt(4000, RandomInsert));
            rndTimes.Add(TimeIt(8000, RandomInsert));
            rndTimes.Add(TimeIt(16000, RandomInsert));
            rndTimes.Add(TimeIt(32000, RandomInsert));
            rndTimes.Add(TimeIt(64000, RandomInsert));
            rndTimes.Add(TimeIt(128000, RandomInsert));
            string rndTimesAsString = string.Join("\n", rndTimes.Select(x => x.ToString()).ToArray());

            orderedTimes.Add(TimeIt(50, OrderedInsert));
            orderedTimes.Add(TimeIt(100, OrderedInsert));
            orderedTimes.Add(TimeIt(200, OrderedInsert));
            orderedTimes.Add(TimeIt(400, OrderedInsert));
            orderedTimes.Add(TimeIt(800, OrderedInsert));
            orderedTimes.Add(TimeIt(1000, OrderedInsert));
            orderedTimes.Add(TimeIt(2000, OrderedInsert));
            orderedTimes.Add(TimeIt(4000, OrderedInsert));
            orderedTimes.Add(TimeIt(8000, OrderedInsert));
            orderedTimes.Add(TimeIt(16000, OrderedInsert));
            orderedTimes.Add(TimeIt(32000, OrderedInsert));
            orderedTimes.Add(TimeIt(64000, OrderedInsert));
            orderedTimes.Add(TimeIt(128000, OrderedInsert));
            string orderedTimesAsString = string.Join("\n", orderedTimes.Select(x => x.ToString()).ToArray());

            Console.WriteLine("Done");
        }

        static double TimeIt(int insertCount, Action<int> f)
        {
            Console.WriteLine("TimeIt({0}, {1})", insertCount, f.Method.Name);

            List<double> times = new List<double>();
            for (int i = 0; i < ITERATION_COUNT; i++)
            {
                Stopwatch sw = Stopwatch.StartNew();
                f(insertCount);
                sw.Stop();
                times.Add(sw.Elapsed.TotalMilliseconds);
            }

            return times.Average();
        }

        static void RandomInsert(int insertCount)
        {
            Treap<double> tree = new Treap<double>((x, y) => x.CompareTo(y));
            for (int i = 0; i < insertCount; i++)
            {
                tree = tree.Insert(rnd.NextDouble());
            }
        }

        static void OrderedInsert(int insertCount)
        {
            Treap<double> tree = new Treap<double>((x, y) => x.CompareTo(y));
            for(int i = 0; i < insertCount; i++)
            {
                tree = tree.Insert(i + rnd.NextDouble());
            }
        }
    }
}

And here's a chart comparing random and ordered insertion times in milliseconds:

Insertions         Random          Ordered         RandomTime / OrderedTime
50                 1.031665        0.261585        3.94
100                0.544345        1.377155        0.4
200                1.268320        0.734570        1.73
400                2.765555        1.639150        1.69
800                6.089700        3.558350        1.71
1000               7.855150        4.704190        1.67
2000               17.852000       12.554065       1.42
4000               40.157340       22.474445       1.79
8000               88.375430       48.364265       1.83
16000              197.524000      109.082200      1.81
32000              459.277050      238.154405      1.93
64000              1055.508875     512.020310      2.06
128000             2481.694230     1107.980425     2.24

I don't see anything in the code which makes ordered input asymptotically faster than unordered input, so I'm at a loss to explain the difference.

Why is it so much faster to build a treap from ordered input than random input?


Solution

  • Self-balancing trees exist to fix the problems associated non-randomly-distributed data. By definition, they trade away a bit of the best-case performance to vastly improve the worst-case performance associated with non-balanced BSTs, specifically that of sorted input.

    You're actually overthinking this problem, because slower insertion of random data vs. ordered data is a characteristic of any balanced tree. Try it on an AVL and you'll see the same results.

    Cameron had the right idea, removing the priority check to force the worst case. If you do that and instrument your tree so you can see how many rebalances are happening for each insert, it actually becomes very obvious what's going on. When inserting sorted data, the tree always rotates left and the root's right child is always empty. Insertion always results in exactly one rebalance because the insertion node has no children and no recursion occurs. On the other hand, when you run it on the random data, almost immediately you start to see multiple rebalances happening on every insert, as many as 5 or 6 of them in the smallest case (50 inserts), because it's happening on subtrees as well.

    With priority checking turned back on, not only are rebalances typically less expensive due to more nodes being pushed into the left subtree (where they never come out of because no insertions happen there), but they are also less likely to occur. Why? Because in the treap, high-priority nodes float to the top, and the constant left-rotations (not accompanied by right-rotations) start to push all the high-priority nodes into the left subtree as well. The result is that rebalances happen less frequently due to the uneven distribution of probability.

    If you instrument the rebalancing code you'll see that this is true; for both the sorted and random input, you end up with almost identical numbers of left-rotations, but the random input also gives the same number of right-rotations, which makes for twice as many in all. This shouldn't be surprising - Gaussian input should result in a Gaussian distribution of rotations. You'll also see that there are only about 60-70% as many top-level rebalances for the sorted input, which perhaps is surprising, and again, that's due to the sorted input messing with the natural distribution of priorities.

    You can also verify this by inspecting the full tree at the end of an insertion loop. With the random input, priorities tend to decrease fairly linearly by level; with the sorted input, priorities tend to stay very high until you get to one or two levels from the bottom.

    Hopefully I've done a decent job explaining this... let me know if any of it is too vague.