I am Using Openshift to deploy decoupled units of Apache Airflow (version 2.8.3) each Scheduler and Webserver is running in separate deployments, in terms of the meta database I am using MySQL 8.3 official image in separate deployments also.
Now the Issue happens when I change to connect a remote database running on VM instead of containers, the connection is established smoothly, and no problem happens until one of the dags starts to run, or is triggered manually, then the scheduler log stops, and this hang cause the web UI to report scheduler is not sending heartbeat.
Hence, all components (scheduler and web config, dag, DB version, and even the data inside) are tested with the same condition when DB is inside openshift and no problem happens there.
I really appreciate any help you can provide.
I have tested to use podman mysql container of the same version, running in a remote VM firstly with port publishment, and other with host network to make the connection direct, all of that failed to solve the issue (there is no problem with accessing any component)
Solved, The Issue was in the subnet configuration, and once I host the database in another VM in a different subnet, all goes well and the scheduler never goes down again.
The point here is to check the database query logs first (in my case the scheduler keeps updating the "latest_heartbeat" field in the job table, so the web server knows its health, and then suddenly it stops, so it's flagged as unhealthy, after so other queries reach the database from the same scheduler, but not the heartbeat updating one, then I knew it's not a reachability issue but something in the middle, and here where the second part comes (check your network) path between the two parts.
Thanks and best wishes