Search code examples
neo4jdocker-composedocker-swarm

Unable to discover other Neo4J causal cluster instances in Docker Swarm


Running with a slightly modified demo docker-compose taken from here, thanks GraphAware guys

I got a successful causal cluster running using docker-compose up. I can't get the same thing up using docker swarm however.

The compose file is the same:

version: '3.3'

networks:
  neonet:
    driver: overlay
    attachable: true
    ipam:
      config:
        - subnet: 10.161.0.0/24

services:

  neo-1:
    image: neo4j:3.3.4-enterprise
    networks:
      - neonet
    volumes:
      - /srv/neo4j/neo4j-core1/data:/data
      - /srv/neo4j/neo4j-core1/logs:/logs
    environment:
      - NEO4J_AUTH=neo4j/blah
      - NEO4J_dbms_mode=CORE
      - NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
      - NEO4J_causalClustering_expectedCoreClusterSize=3
      - NEO4J_causalClustering_initialDiscoveryMembers=neo-1:5000,neo-2:5000,neo-3:5000
      - NEO4J_dbms_connector_http_listen__address=:7474
      - NEO4J_dbms_connector_https_listen__address=:6477
      - NEO4J_dbms_connector_bolt_listen__address=:7687

  neo-2:
    image: neo4j:3.3.4-enterprise
    networks:
      - neonet
    volumes:
      - /srv/neo4j/neo4j-core2/data:/data
      - /srv/neo4j/neo4j-core2/logs:/logs
    environment:
      - NEO4J_AUTH=neo4j/blah
      - NEO4J_dbms_mode=CORE
      - NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
      - NEO4J_causalClustering_expectedCoreClusterSize=3
      - NEO4J_causalClustering_initialDiscoveryMembers=neo-1:5000,neo-2:5000,neo-3:5000
      - NEO4J_dbms_connector_http_listen__address=:7474
      - NEO4J_dbms_connector_https_listen__address=:6477
      - NEO4J_dbms_connector_bolt_listen__address=:7687

  neo-3:
    image: neo4j:3.3.4-enterprise
    networks:
      - neonet
    volumes:
      - /srv/neo4j/neo4j-core3/data:/data
      - /srv/neo4j/neo4j-core3/logs:/logs
    environment:
      - NEO4J_AUTH=neo4j/blah
      - NEO4J_dbms_mode=CORE
      - NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
      - NEO4J_causalClustering_expectedCoreClusterSize=3
      - NEO4J_causalClustering_initialDiscoveryMembers=neo-1:5000,neo-2:5000,neo-3:5000
      - NEO4J_dbms_connector_http_listen__address=:7474
      - NEO4J_dbms_connector_https_listen__address=:6477
      - NEO4J_dbms_connector_bolt_listen__address=:7687

..except in the docker-compose up i neither specify overlay network details, nor deploy specifics. Both clusters run on a single machine.

If i shell into the container for the standalone docker-compose, the ip address looks ok and port 5000 is 'curlable'; doing the same (curl ip:5000) for the swarm deployed container results in connection refused.

Running netstat -ntlp gives:

/var/lib/neo4j # netstat -ntlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 10.161.0.166:5000       0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.11:44137        0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:7000            0.0.0.0:*               LISTEN      -

gives port 5000 listening on an ip address that is not of any interface on this machine (ifconfig):

eth0      Link encap:Ethernet  HWaddr 02:42:0A:A1:00:A7
          inet addr:10.161.0.167  Bcast:10.161.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:119 errors:0 dropped:0 overruns:0 frame:0
          TX packets:119 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:7110 (6.9 KiB)  TX bytes:7110 (6.9 KiB)

eth1      Link encap:Ethernet  HWaddr 02:42:AC:12:00:06
          inet addr:172.18.0.6  Bcast:172.18.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:648 (648.0 B)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:58 errors:0 dropped:0 overruns:0 frame:0
          TX packets:58 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:3604 (3.5 KiB)  TX bytes:3604 (3.5 KiB)

..as you can see there are 2 interfaces, my neonet network, and (i assume) docker's ingress.

Furthermore, neo4j has instructed itself by config to listen for discovery on all interfaces:

causal_clustering.transaction_listen_address=0.0.0.0:6000
causal_clustering.transaction_advertised_address=2a9e1683a92e:6000
causal_clustering.raft_listen_address=0.0.0.0:7000
causal_clustering.raft_advertised_address=2a9e1683a92e:7000
causal_clustering.initial_discovery_members=neo1:5000,neo2:5000,neo3:5000
causal_clustering.expected_core_cluster_size=3
causal_clustering.discovery_listen_address=0.0.0.0:5000
causal_clustering.discovery_advertised_address=2a9e1683a92e:5000
EDITION=enterprise
ACCEPT.LICENSE.AGREEMENT=yes

...but is somehow making a decision to listen on a certain IP - which it does for 5000 but not for 7000 incidentally.

I'm no networking fundi, but it doesn't look right to listen on an IP that is bound to no interface on this machine.

How to instruct Neo4J to bind to all interfaces? or at least a valid one?


Solution

  • Turns out there were multiple fixes, the core being setting deploy.endpoint_node: dnsrr to prevent the creation of a docker virtual IP. In the end my working swarm file looks like below.

    Working = multiple node working neo4j causal cluster of cores (only); working 100% with Neo4J OGM v3 client connection url bolt+routing://neo-1:7687. I wasn't brave enough yet to try fail over the initial connection; so SPF on neo-1 (initially).

    version: '3.3'
    
    services:
      neo-1:
        image: neo4j:3.3.4-enterprise
        volumes:
          - neo-data:/data
          - neo-logs:/var/lib/neo4j/logs
        environment:
          - NEO4J_AUTH=neo4j/blah
          - NEO4J_causalClustering_discoveryAdvertisedAddress=neo-1:5000
          - NEO4J_causalClustering_transactionAdvertisedAddress=neo-1:6000
          - NEO4J_causalClustering_raftAdvertisedAddress=neo-1:7000
          - NEO4J_causalClustering_expectedCoreClusterSize=3
          - NEO4J_causalClustering_initialDiscoveryMembers=neo-1:5000,neo-2:5000,neo-3:5000
          - NEO4J_dbms_connectors_default__advertised__address=neo-1
          - NEO4J_dbms_connector_bolt_advertised__address=:7687
          - NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
          - NEO4J_dbms_mode=CORE
    
        deploy:
          mode: global
          endpoint_mode: dnsrr
          placement:
            constraints:
              - node.labels.neodb == 1
        networks:
          - neonet
    
      neo-2:
        image: neo4j:3.3.4-enterprise
        volumes:
          - neo-data:/data
          - neo-logs:/var/lib/neo4j/logs
        environment:
          - NEO4J_AUTH=neo4j/blah
          - NEO4J_causalClustering_discoveryAdvertisedAddress=neo-2:5000
          - NEO4J_causalClustering_transactionAdvertisedAddress=neo-2:6000
          - NEO4J_causalClustering_raftAdvertisedAddress=neo-2:7000
          - NEO4J_causalClustering_expectedCoreClusterSize=3
          - NEO4J_causalClustering_initialDiscoveryMembers=neo-1:5000,neo-2:5000,neo-3:5000
          - NEO4J_dbms_connectors_default__advertised__address=neo-2
          - NEO4J_dbms_connector_bolt_advertised__address=:7687
          - NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
          - NEO4J_dbms_mode=CORE
    
        deploy:
          mode: global
          endpoint_mode: dnsrr
          placement:
            constraints:
              - node.labels.neodb == 2
        networks:
          - neonet
    
      neo-3:
        image: neo4j:3.3.4-enterprise
        volumes:
          - neo-data:/data
          - neo-logs:/var/lib/neo4j/logs
        environment:
          - NEO4J_AUTH=neo4j/blah
          - NEO4J_causalClustering_discoveryAdvertisedAddress=neo-3:5000
          - NEO4J_causalClustering_transactionAdvertisedAddress=neo-3:6000
          - NEO4J_causalClustering_raftAdvertisedAddress=neo-3:7000
          - NEO4J_causalClustering_expectedCoreClusterSize=3
          - NEO4J_causalClustering_initialDiscoveryMembers=neo-1:5000,neo-2:5000,neo-3:5000
          - NEO4J_dbms_connectors_default__advertised__address=neo-3
          - NEO4J_dbms_connector_bolt_advertised__address=:7687
          - NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
          - NEO4J_dbms_mode=CORE
    
        deploy:
          mode: global
          endpoint_mode: dnsrr
          placement:
            constraints:
              - node.labels.neodb == 3
        networks:
          - neonet
    
    networks:
      neonet:
        driver: overlay
    
    volumes:
      neo-data:
      neo-logs:
    

    I'm pretty sure this is too verbose; and by now there's probably a solution that allows only one service (with multiple replicas) to be declared.