Search code examples
algorithmtrend

Real-time trending algorithm


I'm developing a system that has to return the most trending 'articles' in real time based on the no. of hits that article has.

My first thought was storing for each article the no. of hits vs. time. Then I would normalize this function, and calculate its first derivative which will return the growth rate. Then with the second derivative I'd be able to know how much it's growing and if it reaches a certain threshold -> tag it as trending.

The problem is: I can do that "offline" for example at the end of the day, but I don't know how to do it continously...

I know that there exist such things as Storm but I'm looking for something as specific as this (a written algorithm in pseudocode or an article approaching this problem and not the generic one).


Solution

  • A simple way to keep track of the hottest articles would be to use an LRU-cache. This efficiently keeps track of its most recently accessed elements. Frequent accesses will keep hot articles in the LRU while infrequently accessed articles drop out. You could modify the LRU to keep track of the number of hits an article has received if you want to refine the ranking.

    Sure, there are more complicated approaches, but this is easy to implement, has nice computational properties, and will likely provide a similar response.

    You could also try using a circular buffer for each article. Once the buffer fills, each access to the article can be used to update a variable indicating the oldest hit on the article. Since you know the newest and oldest hit, you can estimate the hits per time unit. Downside: lots of memory usage, inexact statistics for frequently-accessed articles.

    Similarly, you could associate each article with a linked list containing <Time,Count> pairs. Every time a hit happens, add to the count at the top of the linked list if less than a second (or what have you) has passed since Time. If too much time has passed, add a new pair at the front of the list. You can then treat the linked list as a data series for calculating derivatives. Curtail elements which are too old when you walk the list.