Search code examples
scalahashmapapache-flinkflink-streaming

Can Flink handle ~50 GB of state for a single table/window?


I am building a streaming analytic that requires ~50 GB of initial state in-memory for a single table. ~50 GB is the amount of RAM used when I load the state into a Scala HashMap[String,String].

Can Flink handle having ~50 GB of state for a single table that grows over time?

Will I be able to perform lookups and updates to this table in a streaming fashion?

Notes:

  • I cannot change the types to anything smaller.
  • The state is used as a lookup for mapping one String to another String.
  • It would take like three years for the state to double to 100 GB (aggressive estimate as the current state required ten years to produce).
  • This Flink blog claims that the state size should not be a problem but I thought I would double check before spinning it up. Terabytes of state are mentioned.

Solution

  • 50-100 GB for a single table in Flink state is not a problem.

    But to be clear, when we talk about having huge amounts of state in Flink (e.g., terabytes) we are talking about keyed state that is sharded across many parallel tasks. Yes, you can have a single table that is very large, but any given instance will only have a subset of the rows of that table.

    Note that you will need to choose a state backend -- either a heap-based state backend that will keep the state in memory, as objects on the JVM heap, or the RocksDB state backend, that will keep the state as serialized bytes on disk with an in-memory cache.