Search code examples
google-bigquerygoogle-cloud-dataflowapache-beam

How can I convert Apache Beam Rows to BigQuery TableFlows in a Java DataFlow pipeline?


I need to write data into a BigQuery within a DataFlow pipeline using Java. As such i need to get a PCollection of TableRow

TableSchema tableSchema = // omitted
BigQueryIO.Write<TableRow> bigQuerySink =  BigQueryIO.<TableRow>write()
   .to(new TableReference()
     .setProjectId(options.getProject())
     .setDatasetId(options.getBigQueryDataset())
     .setTableId(options.getBigQueryTable()))
    .withSchema(tableSchema)
    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND);

As of now, in my pipeline, I have a PCollection of generic Row objects from Apache Beam org.apache.beam.sdk.values.Row. I also have the beam schema Schema.

So i need to implement a DoFn<Row, TableRow> and I will probably come up with something but it is not easy supporting multi valued types ARRAY(INT64) to INT64 with REPLICATED etc.

My feeling is I am reinventing the wheel, does this class already exist ? Do I need such a mapping ?

I look at this mapper in the Dataflow templates and the google cloud libraries without finding it.


Solution

  • Answering my question (seems like posting out your issue makes you twice shaper)

    I found the Mapper I was looking in BigQueryUtils class here https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java#L508-L516

    Required dependency is:

    <dependency>
     <groupId>org.apache.beam</groupId>
     <artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
    </dependency>