What I have:
I run celery with rabbitMQ as a broker and redis as the result backend. I have an app that sends tasks and workers that process the tasks.
I deployed this as follow:
First if you have recommendation on this architecture I would be glad to get advice.
The problem:
Everything works well, the tasks are dispatched on the different workers which send back the results etc etc...
When a task is sent after 30 minutes with no task running I observe redis latency of 2 minutes when the task is not sent to the 'local_worker'.
It is like redis or the VM was sleeping for 2 minutes before sending back the result. As it is exactly 120 seconds (2 minutes) I expect that it is something wanted by redis, celery or azure (something deterministic)
I don't use a redis conf file, only default settings (except for the password) to run the redis server.
Thanks for your help and feedback on my architecture and problems.
Here is a screenshot of what I see in flower. The three tasks are the same (removing a directory).
The first and the third tasks have been processed by the local worker. The second one has been processed by an external worker. On the logs of the external worker I put a print line just befor returning the results and this line has been printed at 14:14:23. So there has been 120 seconds between this print and the official end of the task.
EDIT:
I found that the default value for redis_socket_timeout was 120 seconds.
I removed the line redis_retry_on_timeout = True
and added the line redis_socket_keepalive = True
in my celery config file. Now the error I get is that the task failed with redis.exceptions.TimeoutError: Timeout reading from socket
.
I don't know why the socket times out whereas the result is ready. Is it a problem with the network of my container instance?
Here is my docker-compose:
version: "3.5"
services:
rabbitmq:
image: rabbitmq:3.8-management
restart: always
ports:
- 5672:5672
labels:
- traefik.enable=true
- traefik.http.services.rabbitmq-ui.loadbalancer.server.port=15672
- traefik.http.routers.rabbitmq-ui-http.entrypoints=http
- traefik.http.routers.rabbitmq-ui-http.rule=(Host(`rabbitmq.${HOSTNAME?Variable not set}.example.app`))
- traefik.docker.network=traefik-public
- traefik.http.routers.rabbitmq-ui-https.entrypoints=https
- traefik.http.routers.rabbitmq-ui-https.rule=Host(`rabbitmq.${HOSTNAME?Variable not set}.example.app`)
- traefik.http.routers.rabbitmq-ui-https.tls=true
- traefik.http.routers.rabbitmq-ui-https.tls.certresolver=le
- traefik.http.routers.rabbitmq-ui-http.middlewares=https-redirect
env_file:
- .env
environment:
- RABBITMQ_DEFAULT_USER=${RABBITMQ_DEFAULT_USER}
- RABBITMQ_DEFAULT_PASS=${RABBITMQ_DEFAULT_PASS}
networks:
- traefik-public
redis:
image: redis:6.2.5
restart: always
command: ["redis-server", "--requirepass", "${RABBITMQ_DEFAULT_PASS:-password}"]
ports:
- 6379:6379
networks:
- traefik-public
flower:
image: mher/flower:0.9.5
restart: always
labels:
- traefik.enable=true
- traefik.http.services.flower-ui.loadbalancer.server.port=5555
- traefik.http.routers.flower-ui-http.entrypoints=http
- traefik.http.routers.flower-ui-http.rule=Host(`flower.${HOSTNAME?Variable not set}.example.app`)
- traefik.docker.network=traefik-public
- traefik.http.routers.flower-ui-https.entrypoints=https
- traefik.http.routers.flower-ui-https.rule=Host(`flower.${HOSTNAME?Variable not set}.example.app`)
- traefik.http.routers.flower-ui-https.tls=true
- traefik.http.routers.flower-ui-https.tls.certresolver=le
- traefik.http.routers.flower-ui-http.middlewares=https-redirect
- traefik.http.routers.flower-ui-https.middlewares=traefik-admin-auth
env_file:
- .env
command:
- "--broker=amqp://${RABBITMQ_DEFAULT_USER:-guest}:${RABBITMQ_DEFAULT_PASS:-guest}@rabbitmq:5672//"
depends_on:
- rabbitmq
- redis
networks:
- traefik-public
local_worker:
build:
context: ..
dockerfile: ./setup/devops/docker/app.dockerfile
image: swtools:app
restart: always
volumes:
- ${SWTOOLSWORKINGDIR:-/tmp}:${SWTOOLSWORKINGDIR:-/tmp}
command: ["celery", "--app=app.worker.celery_app:celery_app", "worker", "-n", "local_worker@%h"]
env_file:
- .env
environment:
- RABBITMQ_DEFAULT_USER=${RABBITMQ_DEFAULT_USER}
- RABBITMQ_DEFAULT_PASS=${RABBITMQ_DEFAULT_PASS}
- RABBITMQ_HOST=rabbitmq
- REDIS_HOST=${HOSTNAME?Variable not set}
depends_on:
- rabbitmq
- redis
networks:
- traefik-public
dashboard_app:
image: swtools:app
restart: always
labels:
- traefik.enable=true
- traefik.http.services.dash-app.loadbalancer.server.port=${DASH_PORT-8080}
- traefik.http.routers.dash-app-http.entrypoints=http
- traefik.http.routers.dash-app-http.rule=Host(`dashboard.${HOSTNAME?Variable not set}.example.app`)
- traefik.docker.network=traefik-public
- traefik.http.routers.dash-app-https.entrypoints=https
- traefik.http.routers.dash-app-https.rule=Host(`dashboard.${HOSTNAME?Variable not set}.example.app`)
- traefik.http.routers.dash-app-https.tls=true
- traefik.http.routers.dash-app-https.tls.certresolver=le
- traefik.http.routers.dash-app-http.middlewares=https-redirect
- traefik.http.middlewares.operator-auth.basicauth.users=${OPERATOR_USERNAME?Variable not set}:${HASHED_OPERATOR_PASSWORD?Variable not set}
- traefik.http.routers.dash-app-https.middlewares=operator-auth
volumes:
- ${SWTOOLSWORKINGDIR:-/tmp}:${SWTOOLSWORKINGDIR:-/tmp}
command: ['waitress-serve', '--port=${DASH_PORT:-8080}', 'app.order_dashboard:app.server']
env_file:
- .env
environment:
- RABBITMQ_DEFAULT_USER=${RABBITMQ_DEFAULT_USER}
- RABBITMQ_DEFAULT_PASS=${RABBITMQ_DEFAULT_PASS}
- RABBITMQ_HOST=rabbitmq
- REDIS_HOST=${HOSTNAME?Variable not set}
networks:
- traefik-public
depends_on:
- rabbitmq
- redis
networks:
traefik-public:
external: true
and my celery config file:
import os
import warnings
from pathlib import Path
# result backend use redis
result_backend_host = os.getenv('REDIS_HOST', 'localhost')
result_backend_pass = os.getenv('REDIS_PASS', 'password')
result_backend = 'redis://:{password}@{host}:6379/0'.format(password=result_backend_pass, host=result_backend_host)
# redis_retry_on_timeout = True
redis_socket_keepalive = True
# broker use rabbitmq
rabbitmq_user = os.getenv('RABBITMQ_DEFAULT_USER', 'guest')
rabbitmq_pass = os.getenv('RABBITMQ_DEFAULT_PASS', 'guest')
rabbitmq_host = os.getenv('RABBITMQ_HOST', 'localhost')
broker_url = 'amqp://{user}:{password}@{host}:5672//'.format(user=rabbitmq_user, password=rabbitmq_pass, host=rabbitmq_host)
include = ['app.worker.tasks', 'app.dashboard.example1', 'app.dashboard.example2']
#task events
worker_send_task_events = True
task_send_sent_event = True
All the env variables are defined and it works well except my socket timeout problem! When I deploy a new worker on a container instance, I set the env variables so it connects to the rabbitmq and redis running on the docker-compose.
Here is my celery file that defines the celery app:
from celery import Celery
from app.worker import celery_config
celery_app = Celery()
celery_app.config_from_object(celery_config)
Finally changing the backend to rpc resolved the problem. I tried different things with redis that didn't work. A way to dig would be to inspect the sockets with tcp-dump to see where it blocks but I didn't try that I my problem was solved with the rpc backend.