windows docker docker-compose freeze fluentd

Docker containers become unresponsive/hang on error

I'm running Docker Desktop on Windows and am having a problem with containers becoming unresponsive on startup errors. This doesn't happen 'every' time, but by far most of the time. Consequently, I have to be very careful to start my containers 1 at a time, and if I see one error, I have to "Restart Docker Desktop" and start the starting again.

I'm using docker-compose and as a specific example, this morning I started elasticsearch, zookeeper, then kafka. Kafka threw an exception regarding the zookeeper state and shuts down - but now the kafka container is unresponsive in docker. I can't stop it (it's already stopped?) but it shows as running. I can't CLI into it, I can't restart it. The only way forwards is to restart docker using the debug menu. (If I have the restart:always flag on, then the containers will actually restart automatically, but given they're throwing errors, it will just spin around in circles starting then dying without my being able to stop/kill/remove the offending container) Once I've restarted docker, I'll be able to view the log of the container and see the error that was thrown...

This happens with pretty much all of my containers, however it does appear that if I start the container whilst viewing the log window within Docker Desktop, it is perhaps 'more likely' that I'll be able to start the container again if it has an error.

I've tried several different containers and this seems to be a pretty common issue for us, it doesn't appear to relate to any specific settings that I'm passing into the containers, however an extract from our docker-compose file is below:

volumes:
   zData:
   kData:
   eData:

   zookeeper:
      container_name: zookeeper
      image: bitnami/zookeeper:latest
      environment:
         ALLOW_ANONYMOUS_LOGIN: "yes"           #Dev only
         ZOOKEEPER_ROOT_LOGGER: WARN, CONSOLE
         ZOOKEEPER_CONSOLE_THRESHOLD: WARN
      ports:
         - "2181:2181"
      volumes:
         - zData:/bitnami/zookeeper:rw
      logging:
         driver: "fluentd"
         options:
            fluentd-address: localhost:24224
            tag: zookeeper
            fluentd-async-connect: "true"
         
   kafka:
      container_name: kafka
      image: bitnami/kafka:latest
      depends_on:
         - zookeeper
      environment:
         ALLOW_PLAINTEXT_LISTENER: "yes"  # Debug only
         KAFKA_ADVERTISED_PORT: 9092
         KAFKA_ADVERTISED_HOST_NAME: kafka
         KAFKA_CREATE_TOPICS: xx1_event:1:1,xx2_event:1:1,xx3_event:1:1,xx4_event:1:1
         KAFKA_JMX_OPTS: -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=${DOCKER_HOSTNAME} -Dcom.sun.management.jmxremote.rmi.port=9096 -Djava.net.preferIPv4Stack=true
         JMX_PORT: 9096
         KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      hostname: kakfa
      ports:
         - 9092:9092
         - 9096:9096
      volumes:
         - kData:/bitnami/kafka:rw
      logging:
         driver: "fluentd"
         options:
            fluentd-address: localhost:24224
            tag: zookeeper
            fluentd-async-connect: "true"

   elasticsearch:
      image: bitnami/elasticsearch:latest
      container_name: elasticsearch
      cpu_shares: 2048
      environment:
         ELASTICSEARCH_HEAP_SIZE: "2048m"
         xpack.monitoring.enabled: "false"
      ports:
         - 9200:9200
         - 9300:9300
      volumes:
        - C:/config/elasticsearch.yml:/opt/bitnami/elasticsearch/config/my_elasticsearch.yml:rw
        - eData:/bitnami/elasticsearch/data:rw

I've wondered about the potential for this to be a resourcing issue, however I'm running this on an a reasonably spec'd laptop (i7 laptop, SSD, 16GB RAM) using WSL2 (also happens when using Hyper-V) and RAM limits don't look like they're being approached. And when there are no errors on startup, the system runs fine and uses far more resources.

Any ideas on what I could try? I'm surprised there's not many more people struggling with this?

Solution

There is currently an issue https://github.com/moby/moby/issues/40063 where containers will hang/freeze/become unresponsive when logging is set to fluentd in asynchronous mode AND the fluentd container is not operational.