I'm facing a strange issue with my Docker Swarm (a cluster of 3 managers and 5 workers). I have many running services right now and when I approach around 100 services (and with replications more than 110 services), the new services I want to run won't start.
When I list the service, I have this:
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
alam7whfn1xe service_name.1 some_image Running New 22 hours ago
You can see CURRENT STATE == New 22 hours ago
. If I try to inspect the logs, they're empty. Inspecting the service won't help either (nothing relevant).
If I stop some of my services, the service tagged with New
state may start by itself after the first retry. It seems that I reached a limit by any way.
I followed up some documentation on the web and there is nothing clear about this issue. You'll be welcome if you can point me some links.
Today, in my opinion, I suspect that the networks I created in the Swarm (--driver=overlay
) have an insufficient IP range and can't give enough IP to containers. These networks are /24
subnets. Is there any way to "flush" the IP reservations in order to re-initialize the networks without recreation Docker networks?
After investigation, there are two types of services that can reach this New
state and they're on 2 same networks.
The result of docker network inspect
:
[
{
"Name": "network_name",
"Id": "okbrl5twyheq32ht3zw5l00gs",
"Created": "0001-01-01T00:00:00Z", <- this is the real date, strange isn't it?
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.16.2.0/24",
"Gateway": "172.16.2.1"
}
]
},
"Internal": false,
"Attachable": true,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": null,
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4097"
},
"Labels": null
}
]
Additionnaly, docker version
:
Client:
Version: 17.06.2-ce
API version: 1.30
Go version: go1.8.3
Git commit: cec0b72
Built: Tue Sep 5 20:00:06 2017
OS/Arch: linux/amd64
Server:
Version: 17.06.2-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: cec0b72
Built: Tue Sep 5 19:58:57 2017
OS/Arch: linux/amd64
Experimental: false
N.B.: I don't want to update Docker in this moment.
EDIT 1:
I read again the Docker documentation about networks and they are mentionning an opened issue on Moby's Github Project Swarm Mode at Scale #30820.
Overlay network limitations
You should create overlay networks with /24 blocks (the default), which limits you to 256 IP addresses, when you create networks using the default VIP-based endpoint-mode. This recommendation addresses limitations with swarm mode. If you need more than 256 IP addresses, do not increase the IP block size. You can either use dnsrr endpoint mode with an external load balancer, or use multiple smaller overlay networks. See Configure service discovery for more information about different endpoint modes.
-- https://docs.docker.com/engine/reference/commandline/network_create/#overlay-network-limitations
EDIT 2:
Based on Flavio 'fcrisciani' Crisciani's comment on the issue Swarm Mode at Scale #30820, I'll try to add the option --endpoint-mode=dnsrr
on my services.
Each service and task gets IP address so the overlay network that the services get connected should have subnet that can support enough ip addresses.
Use following command to create docker network with larger range of supported IPs:
docker network create --driver=overlay --subnet=10.10.0.0/16 <network_name>
Reference: https://github.com/docker/for-aws/issues/104#issuecomment-331563445 https://docs.docker.com/engine/reference/commandline/network_create/