How to achieve a long running Flink job considering that kerberos password changes?

The discussions on long running flink (or spark) jobs omit the discussion on how to avoid a failure where passwords are periodically rolled.

A password roll will invalidate any keytabs being used by the application and the job will then fail once the current sessions ticket expires, which might be 24 hrs post the password change

I don't see anything at the moment in flink to support continuous running in the event of a password roll.

The app will fail and have to be rescheduled from scratch.

Any art in this space to avoid this failure?

For example, is there a feature that will allow us to periodically refresh the keytab? Is anyone doing that?

Solution

Flink currently does not support any kind of keytab refresh and there has been no to little interest in this feature so far. A keytab is considered very long living and rarely changing. It's similar to a role in AWS which is used to fetch the password of a DB. The password is often rotated but the role changes almost never.

Thus, there is currently no way but to let the whole application fail and reschedule it. If keytab changes are appearing quite often, I'd automate it and proactively reschedule on off-hours. For more specific hints, you could add if you are using YARN or K8s to run Flink.

If restarting is not an option because of state size and SLAs, you could try to implement your own SecurityModule that implements keytab refreshing. Alternatively, you could file a feature request on Jira. A feature with many votes gets more quickly implemented.