google-cloud-platform cloud persistence google-cloud-dataflow google-cloud-bigtable

Dataflow - State persistence database

We are considering using Beam/Dataflow for stateful processing:

Real-time aggregation of metrics on global windows (every 1min)
Real-time aggregation on a high number of parallel sessions (> 1 mio)

Example: get max price article bought for each 1 mio clients since registered on a portal

Also, we would also like to access those calculated aggregates while not interfering with the real-time job.

Design question : can it be covered by the current state back-end - Windmill/Persistent Disks [1] - or would the use of a database (like BigTable) be a better fit ?

Thanks !

[1] Dataflow - State persistence?

Solution

It is actually possible to define Big Table connectors in Dataflow to perform reading and writing operations. Moreover, there is the project.jobs.get method of the Dialogflow API that returns an instance of a job which is a json response containing also the “currentState” field. Therefore I think you could build a sort of automation script to get this field value and then store it in Big Table database using the Big Table connectors, however it is a quite complex solution and I am not sure if it could be convenient.