I need to find N largest elements in a big collection of data.
I have:
A job that iterates through these items and finds item with largest value
Item largest = null;
// Page through big data
List<Item> items = getNextPage(pageSize);
while (items.size() > 0) {
// Update largest item based on values from current page
for (Item current : items) {
if (largest == null || largest.getValue() < current.getValue()) {
largest = current;
}
}
// Move to next page
items = getNextPage(pageSize);
}
I need:
My approach:
I was thinking about something like priority queue with fixed size
class PQsort implements Comparator<Item> {
public int compare(Item one, Item two) {
return two.getValue() - one.getValue();
}
}
PriorityQueue<Item> pq = new PriorityQueue<Item>(101, new PQsort());
...while...for...
pq.offer(current);
if (pq.size() == 101) {
// Remove the tail somehow
}
...
Removing the tail: Removing tail element of priority queue
What is the optimal solution for this task?
A couple of thoughts on this.
This task is well suited for using multiple processors. You could split the pages across a pool of threads and then combine the results as they complete.
It's unnecessary to insert each value, allow the collection to sort and then remove the smallest. Quicker would be to just check if each item is larger than the smallest (i.e. the last) item in the collection.
Here's a simple example finding the 100 largest integers in an array of 10,000 random integers.
Queue<Integer> largest = new PriorityQueue<>(100);
for (int item : new Random().ints(10000, 0, 100).toArray()) {
if (largest.size() < 100 || largest.peek() < item) {
if (largest.size() == 100)
largest.remove();
largest.add(item);
}
}
System.out.println(largest);