Search code examples
scalasortinglistbuffer

Scala: Sort a ListBuffer of (Int, Int, Int) by the third value most efficiently


I am looking to sort a ListBuffer(Int, Int, Int) by the third value most efficiently. This is what I currently have. I am using sortBy. Note y and z are both ListBuffer[(Int, Int, Int)], which I am taking the difference first. My goal is to optimize this and do this operation (taking the difference between the two lists and sorting by the third element) most efficiently. I am assuming the diff part cannot be optimized but the sortBy can, so I am looking for a efficient way to do to the sorting part. I found posts on sorting Arrays but I am working with ListBuffer and converting to an Array adds overhead, so I rather not to convert my ListBuffer.

val x = (y diff z).sortBy(i => i._3)

Solution

  • 1) If you want to use Scala libraries then you can't do much better than that. Scala already tries to sort your collection in the most efficient way possible. SeqLike defines def sortBy[B](f: A => B)(implicit ord: Ordering[B]): Repr = sorted(ord on f) which calls this implementation:

      def sorted[B >: A](implicit ord: Ordering[B]): Repr = {
        val len = this.length
        val arr = new ArraySeq[A](len)
        var i = 0
        for (x <- this.seq) {
          arr(i) = x
          i += 1
        }
        java.util.Arrays.sort(arr.array, ord.asInstanceOf[Ordering[Object]])
        val b = newBuilder
        b.sizeHint(len)
        for (x <- arr) b += x
        b.result
      }
    

    This is what your code will be calling. As you can see it already uses arrays to sort data in place. According to the javadoc of public static void sort(Object[] a):

    Implementation note: This implementation is a stable, adaptive, iterative mergesort that requires far fewer than n lg(n) comparisons when the input array is partially sorted, while offering the performance of a traditional mergesort when the input array is randomly ordered. If the input array is nearly sorted, the implementation requires approximately n comparisons.

    2) If you try to optimize by inserting results of your diff directly into a sorted structure like a binary tree as you produce them element by element, you'll still be paying the same price: average cost of insertion is log(n) times n elements = n log(n) - same as any fast sorting algorithm like merge sort.

    3) Thus you can't optimize this case generically unless you optimize to your particular use-case.

    3a) For instance, ListBuffer might be replaced with a Set and diff should be much faster. In fact it's implemented as:

     def diff(that: GenSet[A]): This = this -- that
    

    which uses - which in turn should be faster than diff on Seq which has to build a map first:

    def diff[B >: A](that: GenSeq[B]): Repr = {
        val occ = occCounts(that.seq)
        val b = newBuilder
        for (x <- this)
          if (occ(x) == 0) b += x
          else occ(x) -= 1
        b.result
      }
    

    3b) You can also avoid sorting by using _3 as an index in an array. If you insert using that index your array will be sorted. This will only work if your data is dense enough or you are happy to deal with sparse array afterwards. One index might also have multiple values mapping to it, you'll have to deal with it as well. Effectively you are building a sorted map. You can use a Map for that as well, but a HashMap won't be sorted and a TreeMap will require log(n) for add operation again.

    Consult Scala Collections Performance Characteristics to understand what you can gain based on your case.

    4) Anyhow, sort is really fast on modern computers. Do some benchmarking to make sure you are not prematurely optimizing it.

    To summarize complexity for different scenarios...

    Your current case:

    • diff for SeqLike: n to create a map from that + n to iterate over this (map lookup is effectively constant time (C)) = 2n or O(n)
    • sort - O(n log(n))
    • total = O(n) + O(n log(n)) = O(n log(n)), more precisely: 2n + nlog(n)

    If you use Set instead of SeqLike:

    • diff for Set: n to iterate (lookup is C) = O(n)
    • sort - same
    • total - same: O(n) + O(n log(n)) = O(n log(n)), more precisely: n + nlog(n)

    If you use Set and array to insert:

    • diff - same as for Set
    • sort - 0 - array is sorted by construction
    • total: O(n) + O(0) = O(n), more precisely: n. Might not be very practical for sparse data.

    Looks like in the grand scheme of things it does not matter that much unless you have a unique case that benefits from last option (array).

    If you would have a ListBuffer[Int] rather than ListBuffer[(Int, Int, Int)] I would suggest to sort both collections first and then do a diff by doing a single pass through both of them at the same time. This would be O(nlog(n)). In your case a sort by _3 is not sufficient to guarantee exact order in both collections. You can sort by all three fields of a tuple but that will change the original ordering. If you are fine with that and writing your own diff then it might be the fastest option.