Search code examples
algorithmredisdistributed-computingword-countsliding-window

how to implement a "trending counter" to count words with a sliding window?


I have multiple processes counting words. For example, I have multiple kafka consumers, consuming different partitions of a single kafka topic. Each message on the topic is a single word (string). the word is both the key and and the value for my kafka topic. A consumer will consume a message and increase the counter for that word by 1.

I want to be able to query for the 10 most popular words in the past 5 minutes. Once a word is no longer in the current window, I do not want to count it. Let's assume we're using processing time as our timestamp.

What is the best way to do that?

language-agnostic


Solution

  • RedisBloom has a Top-K solution that might fit your use case in case your stream has a large number of different words to count.

    In this blog post (by me :) you can see Top-K is preferable to sort sets both in execution time and in memory requirements. In my benchamrk, for k=10, memory consumption was less than 10kb vs ~6mb for a sorted set. The dataset was the book War and Peace which has total of ~500,000 and about 41,000 different words.

    My recommandation is to keep several Top-K keys and retire them after 5 minutes. The number of keys depends of the resolution you want.

    Another nice feature of Top-K is that you get elements as they are being expelled from the Top-K list. This allow you to track trends. This feature is not available with sort sets.