Problem:
During an upgrade, a pod that needs to be evicted off a node might take longer than the Node Drain timeout and yields the following error:
(UpgradeFailed) Drain of NODE_NAME did not complete pods [STS_NAME:POD_NAME]: Pod
POD_NAME still in state Running on node NODE_NAME, pod termination grace period 15h0m0s was
greater than remaining per node drain timeout. See http://aka.ms/aks/debugdrainfailures
Code: UpgradeFailed
After which the cluster is in failed state.
Due to the grace period of pods not being in my control, I would like to increase the node drain timeout to 31 hours, as there can be 2 of those long grace period pods on a single node. I haven't been able to find anything regarding the node drain timeout though. I can't even figure out, if it's part of K8s, or AKS specifically.
How to increase the per node drain timeout, such that my long grace period pods don't interrupt my node upgrade operations?
EDIT: In the kubectl cli reference, the drain command takes a timeout parameter. As I don't invoke the drain myself, I don't see how this helps me. It lead me to believe that, if anywhere, this needs to be dealt with on the AKS side of things.
Drain timeout is configurable in the august api of aks. Documetnation still pending.
A link to the August API Update that provides drainTimeoutInMinutes: