Search code examples
arraysperformancecomplexity-theorytrie

Expected performance of tries vs bucket arrays with constant load-factor


I know that I can simply use bucket array for associative container if I have uniformly distributed integer keys or keys that can be mapped into uniformly distributed integers. If I can create the array big enough to ensure a certain load factor (which assumes the collection is not too dynamic), than the expected number of collisions for a key will be bounded, because this is simply hash table with identity hash function.

Edit: I view strings as equivalent to positional fractions in the range [0..1]. So they can be mapped into any integer range by multiplication and taking floor of the result.

I can also do prefix queries efficiently, just like with tries. I presume (without knowing a proof) that the expected number of empty slots corresponding to a given prefix that have to be skipped sequentially before the first bucket with at least one element is reached is also going to be bounded by constant (again depending on the chosen load factor).

And of course, I can do stabbing queries in worst-case constant time, and range queries in solely output sensitive linear expected time (if the conjecture of denseness from the previous paragraph is indeed true).

What are the advantages of a tries then?

If the distribution is uniform, I don't see anything that tries do better. But I may be wrong.

If the distribution has large uncompensated skew (because we had no prior probabilities or just looking at the worst case), the bucket array performs poorly, but tries also become heavily imbalanced, and can have linear worst case performance with strings of arbitrary length. So the use of either structure for your data is questionable.

So my question is - what are the performance advantages of tries over bucket arrays that can be formally demonstrated? What kind of distributions elicit those advantages?

I was thinking of distributions with self-similar structure at different scales. I believe those are called fractal distributions, of which I confess to know nothing. May be then, if the distribution is prone to clustering at every scale, tries can provide superior performance, by keeping the load factor of each node similar, adding levels at dense regions as necessary - something that bucket arrays can not do.

Thanks


Solution

  • Tries are good if your strings share common prefixes. In that case, the prefix is stored only once and can be queried with linear performance in the output string length. In a bucket array, all strings with the same prefixes would end up close together in your key space, so you have very skewed load where most buckets are empty and some are huge.

    More generally, tries are also good if particular patterns (e.g. the letters t and h together) occur often. If there are many such patterns, the order of the trie's tree nodes will typically be small, and little storage is wasted.