Search code examples
pythongoogle-cloud-platformgoogle-cloud-dataflowdataflow

How do I set a dataflow window that will continually retrigger for more data after all records have been written to bigquery?


We have a streaming pipeline reading from pub/sub and writing to bigquery. It wasn't working without adding a window function, because a default global window only fires once and doesn't know when to re-trigger. There is no GroupBy or combine.

We tried to add a beam Window with a trigger, but there are some problems. If we use a globalWindow, it runs really slow and sometimes gives null pointer exceptions. If we use a fixed window, it's fast but but it doesn't seem to acknowledge the pub/sub messages sometimes.

What we'd really want is a pipeline that reads from pub/sub, gets a batch of however many it could get, writes to bigquery, and once everything is written and the pubsub messages are acknowledged, retrigger the read-from-pubsub. Is this possible?


Solution

  • I think you are looking for this. You have a composite trigger named Repeatedly.forever and you can combine it with AfterCount

    Something like this where you trigger after 1000 elements read.

    Repeatedly.forever(AfterCount(1000))