apache-spark google-bigquery google-cloud-dataproc spark-bigquery-connector

Bigquery as metastore for Dataproc

We are trying to migrate pyspark script from on-premise which creates and drops tables in Hive with data transformations to GCP platform.

Hive is replaced by BigQuery. In this case, the hive reads and writes is converted to bigquery reads and writes using spark-bigquery-connector.

However the problem lies with creation and dropping of bigquery tables via spark sql as spark sql will default run the create and drop queries on hive backed by hive metastore not on big query.

I wanted to check if there is plan to incorporate DDL statements support as well as part of spark-bigquery-connector.

Also, from architecture perspective is it possible to base the metastore for spark sql on bigquery so that any create or drop statement can be run on bigquery from spark.

Solution

I don't think Spark SQL will support BigQuery as metastore, nor BQ connector will support BQ DDL. On Dataproc, Dataproc Metastore (DPMS) is the recommended solution for Hive and Spark SQL metastore.

In particular, for no-prem to Dataproc migration, it is more straightforward to migrate to DPMS, see this doc.