Search code examples
javaarraysalgorithmsortingdata-structures

Best way to retrieve K largest elements from large unsorted arrays?


I recently had a coding test during an interview. I was told:

There is a large unsorted array of one million ints. User wants to retrieve K largest elements. What algorithm would you implement?

During this, I was strongly hinted that I needed to sort the array.

So, I suggested to use built-in sort() or maybe a custom implementation if performance really mattered. I was then told that using a Collection or array to store the k largest and for-loop it is possible to achieve approximately O(N), in hindsight, I think it's O(N*k) because each iteration needs to compare to the K sized array to find the smallest element to replace, while the need to sort the array would cause the code to be at least O(N log N).

I then reviewed this link on SO that suggests priority queue of K numbers, removing the smallest number every time a larger element is found, which would also give O(N log N). Write a program to find 100 largest numbers out of an array of 1 billion numbers

Is the for-loop method bad? How should I justify pros/cons of using the for-loop or the priorityqueue/sorting methods? I'm thinking that if the array is already sorted, it could help by not needing to iterate through the whole array again, i.e. if some other method of retrieval is called on the sorted array, it should be constant time. Is there some performance factor when running the actual code that I didn't consider when theorizing pseudocode?


Solution

  • Another way of solving this is using Quickselect. This should give you a total average time complexity of O(n). Consider this:

    1. Find the kth largest number x using Quickselect (O(n))
    2. Iterate through the array again (or just through the right-side partition) (O(n)) and save all elements ≥ x
    3. Return your saved elements

    (If there are repeated elements, you can avoid them by keeping count of how many duplicates of x you need to add to the result.)

    The difference between your problem and the one in the SO question you linked to is that you have only one million elements, so they can definitely be kept in memory to allow normal use of Quickselect.