Search code examples
dockernetwork-programmingdocker-swarm

Docker service stuck in New state (Swarm)


I'm facing a strange issue with my Docker Swarm (a cluster of 3 managers and 5 workers). I have many running services right now and when I approach around 100 services (and with replications more than 110 services), the new services I want to run won't start.

When I list the service, I have this:

ID            NAME            IMAGE       NODE  DESIRED STATE  CURRENT STATE     ERROR  PORTS
alam7whfn1xe  service_name.1  some_image        Running        New 22 hours ago

You can see CURRENT STATE == New 22 hours ago. If I try to inspect the logs, they're empty. Inspecting the service won't help either (nothing relevant).

If I stop some of my services, the service tagged with New state may start by itself after the first retry. It seems that I reached a limit by any way.

I followed up some documentation on the web and there is nothing clear about this issue. You'll be welcome if you can point me some links.

Today, in my opinion, I suspect that the networks I created in the Swarm (--driver=overlay) have an insufficient IP range and can't give enough IP to containers. These networks are /24 subnets. Is there any way to "flush" the IP reservations in order to re-initialize the networks without recreation Docker networks?

After investigation, there are two types of services that can reach this New state and they're on 2 same networks.

The result of docker network inspect:

[
    {
        "Name": "network_name",
        "Id": "okbrl5twyheq32ht3zw5l00gs",
        "Created": "0001-01-01T00:00:00Z", <- this is the real date, strange isn't it?
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.16.2.0/24",
                    "Gateway": "172.16.2.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
         "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4097"
        },
        "Labels": null
    }
]

Additionnaly, docker version:

Client:
 Version:      17.06.2-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   cec0b72
 Built:        Tue Sep  5 20:00:06 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.2-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   cec0b72
 Built:        Tue Sep  5 19:58:57 2017
 OS/Arch:      linux/amd64
 Experimental: false

N.B.: I don't want to update Docker in this moment.

EDIT 1:

I read again the Docker documentation about networks and they are mentionning an opened issue on Moby's Github Project Swarm Mode at Scale #30820.

Overlay network limitations

You should create overlay networks with /24 blocks (the default), which limits you to 256 IP addresses, when you create networks using the default VIP-based endpoint-mode. This recommendation addresses limitations with swarm mode. If you need more than 256 IP addresses, do not increase the IP block size. You can either use dnsrr endpoint mode with an external load balancer, or use multiple smaller overlay networks. See Configure service discovery for more information about different endpoint modes.

-- https://docs.docker.com/engine/reference/commandline/network_create/#overlay-network-limitations

EDIT 2:

Based on Flavio 'fcrisciani' Crisciani's comment on the issue Swarm Mode at Scale #30820, I'll try to add the option --endpoint-mode=dnsrr on my services.


Solution

  • Each service and task gets IP address so the overlay network that the services get connected should have subnet that can support enough ip addresses.

    Use following command to create docker network with larger range of supported IPs:

    docker network create --driver=overlay --subnet=10.10.0.0/16 <network_name>
    

    Reference: https://github.com/docker/for-aws/issues/104#issuecomment-331563445 https://docs.docker.com/engine/reference/commandline/network_create/