Search code examples
scalaperformance

Scala Set intersection performance


Using Scala's scala.collection.Set[T]. Given a small set s with only a few elements and another big set b with lots of elements, is there any performance difference between:

s & b // s intersect b

and

b & s // b intersect s.

If yes, which is the fastest?


Solution

  • The answer is: it's complicated.

    The default implementation of an immutable set is scala.collection.immutable.Set. This has special cases for sizes 1 to 4. As soon as you have more than 4 elements, it will use scala.collection.immutable.HashSet.

    Very small s (1..4)

    So let's say you have a large set b and a small set s, with s containing <4 elements.

    Then it will make a large difference:

    b & s will filter all elements of b against membership in s and therefore takes b.count * s.count equality comparisons. Since b is large this will be quite slow.

    s & b will filter the few elements of s against a membership in b, which is s.length times a hashing and an equality comparison if the hashes match (remember b is a HashSet). Since is is small this should be very fast.

    Small s (n>4)

    As soon as s is larger than 4 elements, it also will be a HashSet. Intersection for HashSets is written in a symmetric and very efficient way. It will combine the tree structures of s and b and perform equality comparisons when the hashes match. It will use maximum structural sharing. E.g. if b contains all elements of s, the result will be the same instance as s, so no objects will be allocated.

    General advice

    If you just write generic code where you don't care much about high performance, it is fine to use the default implementations such as scala.collection.Set. However, if you care about performance characteristics it is preferable to use a concrete implementation. E.g. if s and b are declared as scala.collection.immutable.HashSet, you will have consistent high performance independent of order, provided that you have a good hash function.