Search code examples
c++algorithmdata-structureshashmode

find mode in a rolling window of a long sequence of data with duplicates


Give a sequence of data (with duplicates), move a fix-sized window along the data sequence and find mode in the window at each iteraion, where the oldest data is removed and a new data is inserted to the window.

I cannot find better solutions here.

My idea: Use a hashtable, key is the data, key's data is the frequency of the data occuring in the window.

At the first iteration, iterate each data in the window and put it to the hashtable, meanwhile cout the frequency of each data. After that, traverse the hashtable and return the data with the highest frequency. For each following iteration, search the oldest data in the hashtable and reduce its frequency by 1 if it is in the hahstable if it becoms 0 use new data to replace the old one. Otherwise, just insert the new data into the hahstable. Traverse the table and return the mode.

It is O(n * m) where n is data seq size and m is window size. The drawback is : The hashtable size is not fixed, it may have resize overhead. Each iteration, the table has to be traversed, it is not effcient.

Is it possble to do it with O(n lg m) or O(n) ?

Any help is appreciated.

thanks

Another solution: At the first iteration, build up a hashtable with data as key and its frequency as value associated with the key. On the base of the hashtable, build up a multimap with frequency as key and associated data as value.

After that, at each iteration, in the window, remove the oldest data and update the hashtable, and then update the multimap with the newest updated one in hashtable. If the map key has multiple data, replace it the new one with only the data whose frequency not changed. But, add a new pair with the new frequency and data.

In the window, get a new data and also update the hashtable, update the multimap with the newest updated one in hashtable.

The entry located at the most right hand side on the multimap (a binary search tree) is the mode because its value is the highest frequency in the current window.

Time O(m + m * lg m + n * lg m) if n >> m, O(n lg m). Space : O(m)

Any better idea ?


Solution

  • Space O(M):

    • One ring buffer to hold the M values.
    • One BST holding M {value, PQ pointer} pairs.
    • One Priority Queue holding M Counts.

    Update in time O(lg M):

    • Find the departing value in the ring buffer O(1),
    • Find the same value in the BST O(lg M),
    • Adjust the count in the linked PQ node.
      • Do an adjust Priority on that node O(lg M)
    • Replace the old ring buffer entry with the new one O(1)
    • Find the new value in the BST O(lg M),
    • Adjust the count in the linked PQ node.
      • Do an adjust Priority on that node O(lg M)
    • GetFirst on PQ to find mode O(1)

    You could get rid of the ring buffer by adding a nextItem pointer to the BST structure, and keeping an external pointer to the oldest item. This speeds it up by one BST lookup, and may be a space win if the value size is larger than the pointer size. But the algorithm becomes more complicated to code.