Load balancing calls to external system in akka-cluster

Akka cluster diagram:

          DATA CENTER 1                            DATA CENTER 2
+-------------------------------+        +-------------------------------+
| +--------+         +--------+ |        | +--------+         +--------+ |
| | NODE 1 |---------| NODE 2 | |        | | NODE 4 |---------| NODE 5 | |
| +--------+         +--------+ |        | +--------+         +--------+ |
|     |                  |      +--------+     |                  |      |
|     |    +--------+    |      |        |     |    +--------+    |      |
|     +----| NODE 3 |----+      |        |     +----| NODE 6 |----+      |
|          +--------+           |        |          +--------+           |
+-------------------------------+        +-------------------------------+

Requirements:

This cluster needs to provide health monitoring for a variety of services
Each node has access to the same list of health check URLs that need to be called periodically
Distribution of health check calls should be balanced across nodes
Don't want multiple nodes to call the same health check URL
If a node goes down, its health check calls should be redistributed to nodes that are up

Thoughts:

Set up a cluster singleton to centralize the responsibility of external calls. However, in a multi-datacenter setup, there would be two singletons (one per datacenter), so how to distribute calls between them?
Is there a better alternative to the cluster singleton? According to the docs: "Using a singleton should not be the first design choice. It has several drawbacks, such as single-point of bottleneck."

So how would you design an Akka cluster system to load balance calls to external systems?

Solution

It sounds like you just want the health checks roughly evenly spread out. You could use Akka's cluster sharding with remember entities and have the number of shards equal to the number of health checks with a single entity per shard. Then when each instance starts up it simply sends a message to all shards to ensure that they are started up. If they are already started they will ignore the message. I'm assuming here that you have significantly more than 6 health checks to be executed.

If an instance goes down, the shard is automatically assigned to a different instance of the app and the entity is started again on that new instance. For aggregating the health checks you could send them all to a Kafka topic or send them to a cluster singleton for aggregation. Maybe you plan on sending the results to a statsd end point or insert them into a database. It wasn't clear how the health check results would be used from your question.