Search code examples
javapriority-queuehighest

Java: Collection of N highest elements


I need to find N largest elements in a big collection of data.

I have:

  • A collection of hundreds of millions items in external database (Cassandra)
  • A job that iterates through these items and finds item with largest value

    Item largest = null;
    
    // Page through big data
    List<Item> items = getNextPage(pageSize);
    while (items.size() > 0) {
    
      // Update largest item based on values from current page
      for (Item current : items) {
        if (largest == null || largest.getValue() < current.getValue()) {
          largest = current;
        }
      }
    
      // Move to next page
      items = getNextPage(pageSize);
    }
    

I need:

  • Extend this job to hold N (lets say 100) elements with highest value

My approach:

  • I was thinking about something like priority queue with fixed size

    class PQsort implements Comparator<Item> {
    
      public int compare(Item one, Item two) {
        return two.getValue() - one.getValue();
      }
    }
    
    PriorityQueue<Item> pq = new PriorityQueue<Item>(101, new PQsort());
    
    ...while...for...
    
    pq.offer(current);
    if (pq.size() == 101) {
      // Remove the tail somehow 
    }
    ...
    
  • Removing the tail: Removing tail element of priority queue

What is the optimal solution for this task?


Solution

  • A couple of thoughts on this.

    This task is well suited for using multiple processors. You could split the pages across a pool of threads and then combine the results as they complete.

    It's unnecessary to insert each value, allow the collection to sort and then remove the smallest. Quicker would be to just check if each item is larger than the smallest (i.e. the last) item in the collection.

    Here's a simple example finding the 100 largest integers in an array of 10,000 random integers.

        Queue<Integer> largest = new PriorityQueue<>(100);
    
        for (int item : new Random().ints(10000, 0, 100).toArray()) {
            if (largest.size() < 100 || largest.peek() < item) {
                if (largest.size() == 100)
                    largest.remove();
                largest.add(item);
            }
        }
        System.out.println(largest);