high-availability.storageDir maintenance in flink-conf.yaml

Just using that HA setting hyper produces empty subdirectories in a directory pointed at by the high-availability.storageDir, that all seem to be deletable, except for default/blob subdirectory, which seems to be a placeholder for checkpoints? Just letting this hyperproduction to occur without any maintenance, runs into problems of filling up the disk space, hitting inode limit issues etc. What is supposed way of deleting/compacting high-availability.storageDir (which is by default set to /opt/flink/ha/), just delete everything outside of default/blob, older than some time ago, or...? Is there an HA setting available in flink-conf.yaml that enables some rotation, that doesn't require such maintenance?

We already had an issue that job manager didn't want to start because disk space was exhausted, and checkpoint couldn't have been written, but it was expected on startup, due to the information written in zookeeper, so we had to delete that information.

Other settings related to HA are:

high-availability: zookeeper
high-availability.storageDir: /opt/flink/ha/
high-availability.zookeeper.quorum: zoo-keeper-1.flink.svc:2181,zoo-keeper-2.flink.svc:2181,zoo-keeper-3.flink.svc:2181
high-availability.jobmanager.port: 6123

Solution

Sounds a bit as if your are running into https://issues.apache.org/jira/browse/FLINK-11107, which has recently been fixed in Flink 1.8.1.

Hope this helps.

Konstantin