Search code examples
algorithmperformanceoptimizationdata-structuresmemory-efficient

Which data structure supports given operations efficiently


I need to think of a data structure, which supports the following operations efficiently:
1) Add an integer x
2) Delete an integer with maximum frequency (if there are more than one element with the same maximum frequency delete all of them).
I am thinking of implementing a segment tree where each node stores the index of its child having largest frequency.
Any ideas or suggestions on how to approach this problem or how should it be implemented would be kindly appreciated.


Solution

  • We can use a combination of data structures. A hash_map to maintain the frequency mappings, where the key is the integer, and value a pointer to a "frequency" node representing the frequency value and the set of integers having the same frequency. The frequency nodes will be maintained in a list ordered by the values of the frequencies.

    The Frequency node can be defined as

    class Freq {
       int frequency;
       Set<Integer> values_with_frequency;
       Freq prev;
       Freq next;
    }
    

    The elements HashMap would then contain entries of the form

    Entry<Integer, Freq>

    So, for a snapshot of the dataset such as a,b,c,b,d,d,a,e,a,f,b where the letters denote integers, the following would be how the data structure would look like.

    c -----> (1, [c, e, f])
        |
        |
    e --
        |
        |
    f --
    
    a -----> (3, [a, b])
        |
        |
    b --
    
    d --> (2, [d])
    
    

    The Freq nodes would be maintained in a linked list, say freq_nodes, sorted by the frequency value. Note that, as explained below, there wouldn't be any log(n) operation needed for keeping the list sorted on the add/delete operations.

    The way the add(x), and delete_max_freq() operations could be implemented is as follows

    add(x) : If x is not found in the elements map, check if the first element of the freq_nodes contains the Freq object with frequency 1. If so, add x to the values_with_frequency set of the Freq object. Otherwise, create a new Freq object with 1 as the frequency value and x added to the (now only single element) wrapped set values_with_frequency

    Otherwise, (i.e. if x is already there in the elements map), follow the pointer in the value of the entry corresponding to x in elements to the Freq object in the freq_nodes, remove x from the values_with_frequency field of the Freq object, noting the current value of x’s frequency which is the value of elements.get(x).frequency(Hold this value in say F). If the set values_with_frequency is rendered empty due to this removal, delete the corresponding node from the freq_nodes linked list. Finally if the next Freq node in the freq_nodes linked list has the frequency F+1, just add x to the values_with_frequency field of the next node. Otherwise just create a Freq node as was done in the case of non-existence of Freq node with frequency 1 above.

    Finally, add the entry (x, Freq) to the elements map. Note that this whole add(x) operation is going to be O(1) in time.

    Here's an example of a sequence of add() operations with the subsequent state of the data structure.

    add(a)

    a -> N1 :       freq_nodes :   |N1 (1,  {a}) |   ( N1 is actually a Freq object)
    
    

    add(b)

    a -> N1 :        freq_nodes :   |N1 (1,  {a, b}) | 
    b -> N1
    
    

    add(a) At this point ‘a’ points to N1, however, its current frequency is 2, so we need to insert a node N2 next to N1 in the DLL, after removing it from N1’s values_with_frequency set {a,b}

    a -> N2 :       freq_nodes :   |N1 (1,  {b}) |  --> |N2 (2,  {a}) | 
    b -> N1
    

    The interesting thing to note here is that any time we increase the frequency of an existing element from F to say F+1, we need to do the following

    if (next node has a higher frequency than F+1 or we have reached the end of the list):
    
         create a new Freq node with frequency equal to F+1 (as is done above) 
         and insert it next to the current node
    else :
        add ‘a’ (the input to the add() operation) to the ```values_with_frequency``` set of the next node
    

    The delete_max_freq() operation would just involve removing the last entry of the linked list freq_nodes, and iterating over the keys in the wrapped set values_with_frequency to remove the corresponding keys from the elements map. This operation would take O(k) time where k is the number of elements with maximum frequency.