I have a specific use case for writing my pipeline data. I wanted to make a single Pub/Sub Subscription and wanted to read those from that single source and write the Pcollection at multiple sinks without making another Pub/Sub subscription for it. I've been wanting to make a Pipeline such that I've multiple pipelines in a single dataflow working in parallel and write the same pipeline data, firstly in Google Cloud Storage and Secondly at Bigquery by just using a single subscription. Code or references for the same would be helpful and bring light to the direction I'm working in.
Thanks in advance!!
You only have to do multi sinks in your Beam
job to meet your need.
In Beam
you can build a PCollection
and then sink this PCollection
to multiple places :
Example with Beam Python
:
result_pcollection = (inputs | 'Read from pub sub' >> ReadFromPubSub(
subscription=subscription_path)
| 'Map 1' >> beam.Map(your_map1)
| 'Map 2' >> beam.Map(your_map2)
)
# Sink to Bigquery
(result_pcollection | 'Map 3' >> beam.Map(apply_transform_logic_bq)
| 'Write to BQ' >> beam.io.WriteToBigQuery(
project=project_id,
dataset=dataset,
table=table,
method='YOUR_WRITE_METHOD',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER)
)
# Sink to GCS
(result_pcollection | 'Map 4' >> beam.Map(apply_transform_logic_gcs)
| 'Windowing logic' >> WindowInto(FixedWindows(10*60))
| fileio.WriteToFiles(path=known_args.output)
)
To be able to write a streaming flow to GCS
, you need applying windowing and generate a file per window.