Search code examples
scalagroup-bybloom-filterfoldleft

How to write an efficient groupBy-size filter in Scala, can be approximate


Given a List[Int] in Scala, I wish to get the Set[Int] of all Ints which appear at least thresh times. I can do this using groupBy or foldLeft, then filter. For example:

val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys

will give Set(1,2).

Now suppose the List[Int] is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Ints frequencies, and I only care if they're at least thresh. Once it passed thresh there's no need to check anymore, just add the Int to the Set[Int].

The question is: can I do this more efficiently for a very large List[Int],

a) if I need a true, accurate result (no room for mistakes)

b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int] might include some false-positives, or whether {the frequency of an Int > thresh} isn't really a Boolean but a Double in [0-1].


Solution

  • First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap are effectively constant.

    Now what you can try in order to increase efficiency:

    • update map only when current counter value is less or equal to threshold. This will eliminate huge number of most expensive operations — map updates
    • try faster map instead of IntMap. If you know that values of the initial List are in fixed range, you can use Array instead of IntMap (index as the key). Another possible option will be mutable HashMap with sufficient initail capacity. As my benchmark shows it actually makes significant difference
    • As @ixx proposed, after incrementing value in the map, check whether it's equal to 3 and in this case add it immediately to result list. This will save you one linear traversing (appears to be not that significant for large input)

    I don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).

    Update

    I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).

    Results of benchmarks (avg run times on 1000000 elems with pre-warming):

    Authors solution:
    447 ms
    Ixx solution:
    412 ms
    Ixx solution2 (eliminated excessive map writes):
    150 ms
    My solution:
    57 ms
    

    My solution involves using mutable HashMap instead of immutable IntMap and includes all other possible optimizations.

    Ixx's updated solution:

    val tuple = (Map[Int, Int](), List[Int]())
    val res = myList.foldLeft(tuple) {
      case ((m, s), i) =>
        val count = m.getOrElse(i, 0) + 1
        (if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
    }
    

    My solution:

    val map = new mutable.HashMap[Int, Int]()
    val res = new ListBuffer[Int]
    myList.foreach {
      i =>
        val c = map.getOrElse(i, 0) + 1
        if (c == thresh) {
          res += i
        }
        if (c <= thresh) {
          map(i) = c
        }
    }
    

    The full microbenchmark source is available here.