Search code examples
algorithmmonitoringheartbeat

Scalable algorithm to detect stale data


Here is the problem:

"Agents" installed on many different servers send "heartbeat" signals up to a central server every 5 seconds. How can I find the ones that have missed their heartbeat for more than 10 seconds actively and raise an alert?

The problem is simple if you don't think about scalablity. In the simplest form, you can record the timestamp of the latest heartbeat received from each agent in a database table and run a regular query to find the ones older than the threshold.

This solution is however not scalable to millions of agents.

I am looking for algorithms or techologies that make this possible.


Solution

    1. Use a map: AgentId --> LastHearbeatTime
    2. Use 11 sets (assuming a resolution of 1 second is enough), each holds Ids of Agents reported in a 1-second window.

    Every time an agent report a hearthbeat: 1. Find it in the map 2. Delete it from the relevant set 3. Update it in the map 4. Add it to the relevant set

    Define a thread: Once per second, the oldest set expires. It should be empty. If it doesn't - it contains Ids of agents which did not report. Once a set expires, you can reuse it (cyclic array of sets).

    I believe it can be implemented without locks (maybe you'll need 12 sets).