I need to think of a data structure, which supports the following operations efficiently:
1) Add an integer x
2) Delete an integer with maximum frequency (if there are more than one element with the same maximum frequency delete all of them).
I am thinking of implementing a segment tree where each node stores the index of its child having largest frequency.
Any ideas or suggestions on how to approach this problem or how should it be implemented would be kindly appreciated.
We can use a combination of data structures. A hash_map to maintain the frequency mappings, where the key is the integer, and value a pointer to a "frequency" node representing the frequency value and the set of integers having the same frequency. The frequency nodes will be maintained in a list ordered by the values of the frequencies.
The Frequency node can be defined as
class Freq {
int frequency;
Set<Integer> values_with_frequency;
Freq prev;
Freq next;
}
The elements HashMap would then contain entries of the form
Entry<Integer, Freq>
So, for a snapshot of the dataset such as
a,b,c,b,d,d,a,e,a,f,b
where the letters denote integers, the following would be how the data structure would look like.
c -----> (1, [c, e, f])
|
|
e --
|
|
f --
a -----> (3, [a, b])
|
|
b --
d --> (2, [d])
The Freq nodes would be maintained in a linked list, say freq_nodes
, sorted by the frequency value. Note that, as explained below, there wouldn't be any log(n) operation needed for keeping the list sorted on the add/delete operations.
The way the add(x)
, and delete_max_freq()
operations could be implemented is as follows
add(x) :
If x is not found in the elements
map, check if the first element of the freq_nodes
contains the Freq object with frequency 1. If so, add x to the values_with_frequency
set of the Freq object. Otherwise, create a new Freq object with 1 as the frequency value and x added to the (now only single element) wrapped set values_with_frequency
Otherwise, (i.e. if x is already there in the elements
map), follow the pointer in the value of the entry corresponding to x in elements to the Freq object in the freq_nodes
, remove x from the values_with_frequency
field of the Freq object, noting the current value of x’s frequency which is the value of elements.get(x).frequency
(Hold this value in say F). If the set values_with_frequency
is rendered empty due to this removal, delete the corresponding node from the freq_nodes
linked list. Finally if the next Freq node in the freq_nodes
linked list has the frequency F+1, just add x to the values_with_frequency
field of the next node. Otherwise just create a Freq node as was done in the case of non-existence of Freq node with frequency 1 above.
Finally, add the entry (x, Freq)
to the elements
map.
Note that this whole add(x) operation is going to be O(1) in time.
Here's an example of a sequence of add() operations with the subsequent state of the data structure.
add(a)
a -> N1 : freq_nodes : |N1 (1, {a}) | ( N1 is actually a Freq object)
add(b)
a -> N1 : freq_nodes : |N1 (1, {a, b}) |
b -> N1
add(a)
At this point ‘a’ points to N1, however, its current frequency is 2, so we need to insert a node N2 next to N1 in the DLL, after removing it from N1’s values_with_frequency
set {a,b}
a -> N2 : freq_nodes : |N1 (1, {b}) | --> |N2 (2, {a}) |
b -> N1
The interesting thing to note here is that any time we increase the frequency of an existing element from F to say F+1, we need to do the following
if (next node has a higher frequency than F+1 or we have reached the end of the list):
create a new Freq node with frequency equal to F+1 (as is done above)
and insert it next to the current node
else :
add ‘a’ (the input to the add() operation) to the ```values_with_frequency``` set of the next node
The delete_max_freq() operation would just involve removing the last entry of the linked list freq_nodes
, and iterating over the keys in the wrapped set values_with_frequency
to remove the corresponding keys from the elements
map. This operation would take O(k) time where k is the number of elements with maximum frequency.