google-cloud-platform google-cloud-dataflow apache-beam google-cloud-pubsub

Is it possible to write a single Pcollection at different Output sinks without using side inputs?

I have a specific use case for writing my pipeline data. I wanted to make a single Pub/Sub Subscription and wanted to read those from that single source and write the Pcollection at multiple sinks without making another Pub/Sub subscription for it. I've been wanting to make a Pipeline such that I've multiple pipelines in a single dataflow working in parallel and write the same pipeline data, firstly in Google Cloud Storage and Secondly at Bigquery by just using a single subscription. Code or references for the same would be helpful and bring light to the direction I'm working in.

Thanks in advance!!

Solution

You only have to do multi sinks in your Beam job to meet your need.

In Beam you can build a PCollection and then sink this PCollection to multiple places :

Example with Beam Python :

result_pcollection = (inputs | 'Read from pub sub' >> ReadFromPubSub(
                    subscription=subscription_path) 
        | 'Map 1' >> beam.Map(your_map1) 
        | 'Map 2' >> beam.Map(your_map2)
      )

# Sink to Bigquery
(result_pcollection | 'Map 3' >> beam.Map(apply_transform_logic_bq)
                    | 'Write to BQ' >> beam.io.WriteToBigQuery(
                    project=project_id,
                    dataset=dataset,
                    table=table,
                    method='YOUR_WRITE_METHOD',
                  write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER)
)

# Sink to GCS
(result_pcollection | 'Map 4' >> beam.Map(apply_transform_logic_gcs)
                    | 'Windowing logic' >> WindowInto(FixedWindows(10*60))
                    |  fileio.WriteToFiles(path=known_args.output)
)

To be able to write a streaming flow to GCS, you need applying windowing and generate a file per window.