I was working on a leetcode problem (https://leetcode.com/problems/top-k-frequent-elements/) which is:
Given an integer array nums and an integer k, return the k most frequent elements. You may return the answer in any order.
I solved this using min-heap
(My time complexity calculations are in comment - do correct me if I did a mistake):
from collections import Counter
if k == len(nums):
return nums
# O(N)
c = Counter(nums)
it = iter([(x[1], x[0]) for x in c.items()])
# O(K)
result = list(islice(it, k))
heapify(result)
# O(N-K)
for elem in it:
# O(log K)
heappushpop(result, elem)
# O(K)
return [pair[1] for pair in result]
# O(K) + O(N) + O((N - K) log K) + O(K log K)
# if k < N :
# O(N log K)
Then I saw a solution using Bucket Sort
that suppose to beat the heap solution with O(N)
:
bucket = [[] for _ in nums]
# O(N)
c = collections.Counter(nums)
# O(d) where d is the number of distinct numbers. d <= N
for num, freq in c.items():
bucket[-freq].append(num)
# O(?)
return list(itertools.chain(*bucket))[:k]
How do we compute the time complexity of the itertools.chain
call here?
Is it come from the fact that at most we will chain N
elements? Is that enough to deduce it is O(N)?
In any case, at least on leetcode test cases the first one has better performance - what can be the reason for that?
The time complexity of list(itertools.chain(*bucket))
is O(N) where N is the total number of elements in the nested list bucket
. The chain
function is roughly equivalent to this:
def chain(*iterables):
for iterable in iterables:
for item in iterable:
yield item
The yield
statement dominates the running time, is O(1), and executes N times, hence the result.
The reason your O(N log k) algorithm might end up being faster in practice is because log k is probably not very large; LeetCode says k is at most the number of distinct elements in the array, but I suspect for most of the test cases k is much smaller, and of course log k is smaller than that.
The O(N) algorithm probably has a relatively high constant factor because it allocates N lists and then randomly accesses them by index, resulting in a lot of cache misses; the append
operations may also cause a lot of those lists to be reallocated as they become larger.