I am new to the dataflow programming model and I have a few questions about the way dataflow stores intermediate state in a windowed streaming process. Let's say I am windowing by day and then performing an aggregation. When a new event comes in, it needs to access all the data that is in that window and group.
Is this data stored in memory, on disk, in GCS, or somewhere completely different?
Does the volume of intermediate data effect the number of machines necessary for a job?
What happens to the data when the window is closed?
If I am performing an operation such as summing which does not require all of the data to be stored in an intermediate state, is there a way to tell dataflow to only store the results of the last update?
In the current implementation of Dataflow, this is stored on persistent disk (to protect against machine failures) and opportunistically cached in memory.
The number of machines affects cache performance and the number of disk IOps available, and thus affects per-machine throughput. Intermediate data also may need to be shuffled between machines, increasing the CPU needs
There are two things that can happen to a window; it fires when the trigger fires, and closes (and is garbage-collected) when the watermark passes the end of window plus allowed lateness.
When a window fires, the behavior depends on the window accumulation mode in use. For .accumulatingFiredPanes
, the data is kept until the value in .withAllowedLateness
is passed. For .discardingFiredPanes
, the data is discarded after each firing.
When a window is closed, all remaining data (either the delta or total value) is emitted to the next transform, and all data for the window is cleared. If you are using the default trigger and 0 allowed lateness (also the default), both just happen at once.
Yes! If you use an associative operation (called a Combiner), then intermediate results will be stored in a compact format.