I kind-of "inherited" a project that uses Airflow 2.2.4 installed on a cluster of several nodes (meaning that I wasn't part of the deployment decisions and configurations and I might not be aware of some under-the-hood processes). Each node runs a scheduler, a CeleryExecutor and a webserver. Task logging is done locally on the nodes' file system. However there must be some misconfiguration somewhere and I can't figure it out. Here is what I have observed:
1.log
is written in the log folder on the same node A, and the log is visible in the web UI - so far so good2.log
is written in the log folder on node B, and this last log is visible in the UI1.log
and the problem is that it tries to fetch it from node B rather than node A (I checked that 1.log
effectively exists on node A)Example of UI error message:
*** Log file does not exist: [install_path]/airflow/logs/start_acquisition/run_writegofile/2022-07-18T01:00:00+00:00/1.log
*** Fetching from: http://nodeb.mycompany.com:19793/log/start_acquisition/run_writegofile/2022-07-18T01:00:00+00:00/1.log
*** Failed to fetch log file from worker. Client error '404 NOT FOUND' for url 'http://nodeb.mycompany.com:19793/log/start_acquisition/run_writegofile/2022-07-18T01:00:00+00:00/1.log'
For more information check: https://httpstatuses.com/404
Example of correct log fetching message:
*** Log file does not exist: [install_path]/airflow/logs/start_msci_acquisition/run_writegofile/2022-07-18T01:00:00+00:00/2.log
*** Fetching from: http://nodeb.mycompany.com:19793/log/start_acquisition/run_writegofile/2022-07-18T01:00:00+00:00/2.log
Sorry I had to mask out some sensitive info. More than happy to provide more details about the configuration or else, not sure what can be useful here.
Found out this is a known issue with cluster deployment and local storage of task logs: https://github.com/apache/airflow/pull/23178
Unfortunately it doesn't seem anyone is actively merging this PR.