Search code examples
amazon-web-servicesamazon-ecsnats.io

NATS cluster in AWS - Infinite reconnect attempts when scaling-in


I am trying to set up NATS Server with clustering in AWS ECS with autoscaling.

Background

As a total AWS / NATS newbie, I thought that I could do something very simple. Whenever a new NATS container starts, I use the ECS API and find all the ECS tasks (containers) that are running NATS, get the EC2 IP addresses and mapped ports and pass these via the --routes param. In essence, every node that is already running is a seed.

Using this approach, adding nodes is a breeze, but removing seeds is messy.

I noticed that when a node that was passed via --routes dies, the server is trying to reconnect infinitely. Unfortunately, in a cloud environment, IP addresses come and go, so this is a bit of an issue. Zombie routes remain forever.

Question

I start server C with routes to seeds IP_A and IP_B, and if the host at IP_B dies, C tries to reconnect to IP_B infinitely. The route to IP_B will never come back in a cloud environment if we use IP addresses.

I feel that NATS, being cloud native, should just accept the fact that the seed is gone.

Is it possible to put a maximum limit on the number of reconnect requests a server is going to perform on a seed? I checked the code with my limited go knowledge and couldn’t find an indication of this.

Follow up question

To work around this problem, I thought that a combination of DNS (via Route 53), avoiding IP addresses, and ELB is going to work, but the easiest setup I could come up with is to set up three ELB groups.

  • elb-s1: ELB Seed Group 1 (cross-AZ)
  • elb-s2: ELB Seed Group 2 (cross-AZ)
  • elb-normal: ELB Normal Node Group (cross-AZ)

The instances in ELB Seed Group 1 would be started as follows:

gnatsd --cluster nats://elb-s1:6222 --routes nats://elb-s2:6222,nats://elb-normal:6222

The instances in ELB Seed Group 2 would be started as follows:

gnatsd --cluster nats://elb-s2:6222 --routes nats://elb-s1:6222,nats://elb-normal:6222

The instances in ELB Normal Node Group would be started as follows:

gnatsd --cluster nats://elb-normal:6222 --routes nats://elb-s1:6222,nats://elb-s2:6222

NATS clients would connect to: nats://elb-normal:4222.

The reason why each seed group would point to the normal node group is to make sure that non-solicited seeds in a group discover the rest of the mesh with “external” help.

This feels super-complex. I thought that I am not the first one that wants a resilient HA setup in AWS so I would be grateful for any pointers. Are there any references on the web dealing with NATS clustering in AWS?


Solution

  • Explicit routes are retried forever. Only implicit routes are retried for a limited number of times (that we have made configurable in master branch).

    I understand the problem if an ephemeral IP is used in -routes and the peer goes away and may never come back with that IP, but if you specify something in -routes, that has to be static in a way.

    For instance, you could have 1 or 2 seeds that have a known address that is not going to change (use DNS for instance). Other NATS servers can come and go and will always point to those 2 NATS seeds. The other NATS Servers in the mesh will be notified of the new addition and connect to this server. In that case, this is considered an implicit route, and if the new server goes away, the reconnection will be tried only once (by default, or the number of configured times connect_retries in cluster config).