Given a List[Int]
in Scala, I wish to get the Set[Int]
of all Int
s which appear at least thresh
times. I can do this using groupBy
or foldLeft
, then filter
. For example:
val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys
will give Set(1,2)
.
Now suppose the List[Int]
is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Int
s frequencies, and I only care if they're at least thresh
. Once it passed thresh
there's no need to check anymore, just add the Int
to the Set[Int]
.
The question is: can I do this more efficiently for a very large List[Int]
,
a) if I need a true, accurate result (no room for mistakes)
b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int]
might include some false-positives, or whether {the frequency of an Int
> thresh
} isn't really a Boolean
but a Double
in [0-1]
.
First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap
are effectively constant.
Now what you can try in order to increase efficiency:
Array
instead of IntMap
(index as the key). Another possible option will be mutable HashMap
with sufficient initail capacity. As my benchmark shows it actually makes significant differenceI don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).
Update
I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).
Results of benchmarks (avg run times on 1000000 elems with pre-warming):
Authors solution:
447 ms
Ixx solution:
412 ms
Ixx solution2 (eliminated excessive map writes):
150 ms
My solution:
57 ms
My solution involves using mutable HashMap
instead of immutable IntMap
and includes all other possible optimizations.
Ixx's updated solution:
val tuple = (Map[Int, Int](), List[Int]())
val res = myList.foldLeft(tuple) {
case ((m, s), i) =>
val count = m.getOrElse(i, 0) + 1
(if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
}
My solution:
val map = new mutable.HashMap[Int, Int]()
val res = new ListBuffer[Int]
myList.foreach {
i =>
val c = map.getOrElse(i, 0) + 1
if (c == thresh) {
res += i
}
if (c <= thresh) {
map(i) = c
}
}
The full microbenchmark source is available here.