I know of the following algorithm for heavy-hitters:
Algorithm findHeavyHitters(epsilon, inputStream)
integer k = ceiling(1 / epsilon) - 1
initialize hashmap H of size k
while an item i from the input stream arrives:
if H[i] exists
increment the value associated with H[i]
elsif number of items in H < k
put H[i] into map with value of 1
elseif there exists an entry j with a value of 0
remove j and put H[i] into map with value of 1
else
decrement all values in H by 1
endwhile
return H
Correct me if I'm wrong, but this algorithm does not run in O(n). Is it possible to modify this algorithm so that it runs in O(n) while maintaining the O(1/epsilon) use of space?
For a stream of data, the point of the algorithm is to return the top epsilon*t items. Epsilon is given as a percentage (e.g. input 0.1 for data that occurs at least 10% of the time).
The algorithm runs in average time O(n), on the basis that a hash lookup is, on average, O(1).
There are two implementation details. First, the last step which seems to involve touching every value in H:
To make this O(1), we add one additional storage location, called base
, which is initialized to 0. Then we modify the algorithm as follows:
while an item i from the input stream arrives:
if H[i] exists
increment the value associated with H[i]
elsif number of items in H < k
put H[i] into map with value of base + 1
elseif there exists an entry j with a value of base
remove j and put H[i] into map with value of base + 1
else
increment base
endwhile
The second issue is finding an entry with the value base
(or 0) in O(1). This can be done by keeping the elements in a "comb": a linked list of doubly-linked lists. Each inner linked list contains the entries with a specific count. The outer linked list contains the count lists in order by count, with the head pointing to the list with the smallest count. If you draw this data structure, it looks like a comb:
[ base ] -> entry a -> entry b -> entry c
|
[ base + i ] -> entry d
|
[ base + j ] -> entry e -> entry f
|
etc.
The hash table now points to the entries, rather than containing them. To increment the count of a single entry, the entry is removed from its list (if the list contains more than one element) and either inserted into the next list or put into a one-element list which is inserted after the list it was in, depending on the count associated with the next list. This operation is O(1).
The comb datastructure is still O(k) where k is the number of elements in the hash, because there cannot be more distinct counts than elements.
Instead of doubly-linked lists, you can use a simple array and a list of the indices of the first entry with each count. To move an entry to the next count bucket, you start by swapping it with last entry with that count, and then either advance the beginning of the next count list or insert a new count list entry, depending on whether the next count list's count is one greater or more than one greater. To accomplish the swap, it is necessary to update the location of the two swapped entries in the hash, but that is still O(1).