I need to write data into a BigQuery within a DataFlow pipeline using Java. As such i need to get a PCollection of TableRow
TableSchema tableSchema = // omitted
BigQueryIO.Write<TableRow> bigQuerySink = BigQueryIO.<TableRow>write()
.to(new TableReference()
.setProjectId(options.getProject())
.setDatasetId(options.getBigQueryDataset())
.setTableId(options.getBigQueryTable()))
.withSchema(tableSchema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND);
As of now, in my pipeline, I have a PCollection of generic Row
objects from Apache Beam org.apache.beam.sdk.values.Row
. I also have the beam schema Schema
.
So i need to implement a DoFn<Row, TableRow>
and I will probably come up with something but it is not easy supporting multi valued types ARRAY(INT64)
to INT64
with REPLICATED etc.
My feeling is I am reinventing the wheel, does this class already exist ? Do I need such a mapping ?
I look at this mapper in the Dataflow templates and the google cloud libraries without finding it.
Answering my question (seems like posting out your issue makes you twice shaper)
I found the Mapper I was looking in BigQueryUtils
class here
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java#L508-L516
Required dependency is:
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
</dependency>