We are encountering an issue with Google Kubernetes Engine (GKE) where periodic upgrades to newer versions are causing disruptions to our pods and containers within the cluster. While we understand that upgrades are necessary and expected, the problem arises when the cluster interrupts our services even when requests are actively running.
Our setup includes several services, notably an Express Gateway and multiple Rails services interconnected as follows: Ingress -> Express -> Rails1 -> Rails2.
During a GKE upgrade, if a request is in transit from Express to Rails1 and Rails1 gets terminated due to the upgrade process, we observe that only a generic message is received at the gateway, without any detailed error or indication of the underlying issue.
RequestError: Timeout awaiting 'request' for 3000ms
at ClientRequest.<anonymous> (/app/node_modules/got/dist/source/core/index.js:970:65)
at /app/node_modules/@opentelemetry/context-async-hooks/build/src/AbstractAsyncHooksContextManager.js:50:55
at AsyncLocalStorage.run (node:async_hooks:319:14)
at AsyncLocalStorageContextManager.with (/app/node_modules/@opentelemetry/context-async-hooks/build/src/AsyncLocalStorageContextManager.js:33:40)
at ClientRequest.contextWrapper (/app/node_modules/@opentelemetry/context-async-hooks/build/src/AbstractAsyncHooksContextManager.js:50:32)
at Object.onceWrapper (node:events:628:26)
at ClientRequest.emit (node:events:525:35)
at ClientRequest.origin.emit (/app/node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)
at TLSSocket.socketErrorListener (node:_http_client:494:9)
at TLSSocket.emit (node:events:513:28)
at emitErrorNT (node:internal/streams/destroy:157:8)
at emitErrorCloseNT (node:internal/streams/destroy:122:3)
at processTicksAndRejections (node:internal/process/task_queues:83:21)
at Timeout.timeoutHandler [as _onTimeout] (/app/node_modules/got/dist/source/core/utils/timed-out.js:36:25)
at listOnTimeout (node:internal/timers:561:11)
at processTimers (node:internal/timers:502:7) {
We tried to avoid this Update times in our Business Times but that solves not the underlying Problem. I also looked between the log but I can't see much information. If you need any other Log I try to find and send it here.
You may wish to look into lifecycle hooks (to drain connections during terminations) and poddisruptionbudgets (to help ensure service resiliency) to help mitigate these issues.