I have a simple dataflow pipeline configured for 2 second fixed windows. It reads from pub/sub, deserializes the message to an object, logs the object, groups the objects by a given key, then does some proccessing on the grouped objects. Keep in mind that this is a dataflow job starting up with 1 VM (not like it needs to even do a shuffle since it's only 1 VM). And it only sees a few (like 2-3) messages/sec.
| "read_messages" >> io.ReadFromPubSub(subscription=input_topic_subscription_path)
| "window" >> WindowInto(FixedWindows(2))
| "deser_obj" >> ParDo(DeSerializeToObject())
| "key_by_id" >> WithKeys(lambda obj: obj.id)
| "group_by_player_id" >> GroupByKey()
| "process_group" >> ParDo(ProcessGroup())
The problem I'm seeing is that the GroupByKey takes about 36 seconds pretty consistently. The logs from the process_group
stage are always 25-40 seconds behind the logs in the deser_obj
stage. The data freshness
for the pipeline thus sits around 36 seconds. It doesn't seem to make sense: having a streaming pipeline that processes in 2 second micro-batches, but each micro-batch takes 36 seconds to complete.
I even followed this exact GCP tutorial and it's the same result (i.e. each window takes 36ish seconds and GroupByKey is the bottleneck): https://cloud.google.com/pubsub/docs/pubsub-dataflow#python
Is Dataflow not designed to do streaming in small time intervals like 2 second micro-batches? Based on the numbers i'm seeing, I'd only recommend dataflow streaming if your streaming needs had an SLA > 40 seconds.
It is correct that Dataflow is not (yet) tuned to handle sub-10-second latencies, though 36 seconds does seem a bit on the slow end. FYI, the ReadFromPubSub does introduce a shuffle to ensure exactly once, i.e. there are no duplicates), and the GroupByKey does a full distributed shuffle as well, despite there being only a single VM.