algorithm performance data-structures io duplicates

Check for duplicate input items in a data-intensive application

I have to build a server-side application that will receive a stream of data as input, it will actually receive a stream of integers up to nine decimal digits, and have to write each of them to a log file. Input data is totally random, and one of the requirements is that the application should not write duplicate items to the log file, and should periodically report the number of duplicates items found.

Taking into account that performance is a critical aspect of this application, as it should be able to handle high loads of work (and parallel work), I would like to found a proper solution to keep track of the duplicate entries, as checking the whole log (text) file every time it writes is not a suitable solution for sure. I can think of a solution consisting of maintaining some sort of data structure in memory to keep track of the whole stream of data being processed so far, but as input data can be really high, I don't think is the best way to do it either...

Any idea?

Solution

Assuming the stream of random integers is uniformly distributed. The most efficient way to keep track of duplicates is to maintain a huge bitmap of 10 billion bits in memory. However, this takes a lot of RAM: about 1.2 Gio. However, since this data structure is big, memory accesses may be slow (limited by the latency of the memory hierarchy).

If the ordering does not matter, you can use multiple threads to mitigate the impact of the memory latency. Parallel accesses can be done safely using logical atomic operations. To check if a value is already seen before, you can check the value of a bit in the bitmap then set it (atomically if done in parallel).

If you know that your stream do contains less than one million of integers or the stream of random integers is not uniformly distributed, you can use a hash-set data structure as it store data in a more compact way (in sequential).

Bloom filters could help you to speed up the filtering when the number of value in the stream is quite big and they are very few duplicates (this method have to be combined with another approach if you want get deterministic results).

Here is an example using hash-sets in Python:

seen = set()                 # List of duplicated values seen so far
for value in inputStream:    # Iterate over the stream value
    if value not in seen:    # O(1) lookup
        log.write(value)     # Value not duplicated here
        seen.add(value)      # O(1) appending