I am building a streaming analytic that requires ~50 GB of initial state in-memory for a single table. ~50 GB is the amount of RAM used when I load the state into a Scala HashMap[String,String].
Can Flink handle having ~50 GB of state for a single table that grows over time?
Will I be able to perform lookups and updates to this table in a streaming fashion?
Notes:
50-100 GB for a single table in Flink state is not a problem.
But to be clear, when we talk about having huge amounts of state in Flink (e.g., terabytes) we are talking about keyed state that is sharded across many parallel tasks. Yes, you can have a single table that is very large, but any given instance will only have a subset of the rows of that table.
Note that you will need to choose a state backend -- either a heap-based state backend that will keep the state in memory, as objects on the JVM heap, or the RocksDB state backend, that will keep the state as serialized bytes on disk with an in-memory cache.