Search code examples
ibm-cloudlinux-containers

Bluemix Scalable Container Group auto recovery option


How does the auto-recovery option on a container in a scalable group work?

I have enabled it (by using --auto and it says Autorecovery: On in the web UI) but it did not try to restart the container when it crashed this morning. The container in the group died at 2015-09-29T05:51:27.187Z and was manually restarted over one hour later at 2015-09-29T07:35:33.561Z Restarting the container "solves" the runtime problem (a bug that is being fixed) until a user tries to to the same thing again in the app crashing it.

According to the docs:

To start a new container when one of the containers in the group crashes or becomes unavailable, the Enable autorecovery option. If you do not select this option, a new instance is not started automatically.

Listed in known problems:

Auto-recovery is not immediate

Auto-recovery for container groups might take more than 15 minutes for new systems to come online. Wait for auto-recovery to become available, which can take more than 15 minutes.


Solution

  • For every container in the group, the service will run a curl request against the port that you specified when you created the group.

    If a container does not respond for whatever reason, the service assumes the container needs to be replaced. So it will destroy that container and create a new one in its place.

    The fine print

    1. The containers need to be running a service that responds to http requests on a particular port.
    2. The port that you expose when you create the container group must be the same as the port in #1.
    3. The port in #1/#2 must respond to http requests, not https requests. The route for the group (eg https://example.mybluemix.net) is secure, and traffic internally from the route to containers is also encrypted, so the containers in the group do not need to listen on https.
    4. The service checks every container in the group once every 2 minutes or so.
    5. Roughly if the service has to replace every instance in the group more than 3 times within roughly a 10 minute period, the service will stop tearing down and recovering instances in the group from that point forward. On the Bluemix site, you might see the Autorecovery label switch from On to Off. This is to prevent a never-ending loop of teardowns and replacements of containers that are either always crashing or consistently non-responsive.