Search code examples
hadoophdfsapache-flinkflink-streaming

How to set HDFS as statebackend for flink


I want to store flink store in HDFS so that after crash I can recover the flink state from HDFS. I am planning to write state to HDFS every 60 second. How Can I achieve this ? Is this the config I need to follow ? https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/state_backends.html#setting-default-state-backend

And where do I specify the check point interval ? Any link or sample code would be helpful


Solution

  • Choosing where checkpoints are stored (e.g., HDFS) is separate from deciding which state backend to use for managing your working state (which can be on-heap, or in local files managed by the RocksDB library).

    These two concepts were cleanly separated in Flink 1.12. In early versions of Flink, the two appeared to be more strongly related than they actually are because the filesystem and rocksdb state backend constructors took a file URI as a parameter, specifying where the checkpoints should be stored.

    The best way to manage all of this is to leave this out of your code, and to specify the configuration you want in flink-conf.yaml, e.g.,

    state.backend: filesystem
    state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints
    execution.checkpointing.interval: 10s