Search code examples
apache-sparkgoogle-cloud-platformgoogle-bigquerygoogle-cloud-storagedelta-lake

How to include DeltaLake Files from GCS to BigQuery


Is there a library/connector available to import Delta Lake files stored in Google Cloud Storage (GCS) directly to BigQuery?

I have managed to write BigQuery tables using a Spark Dataframe as intermediary but I can't find any direct connector or BigQuery Library that does this without transitioning through Spark dataframes.

I tried using the official connector spark-bigquery-connector, but documentation is lacking on how to point to a specific project in BigQuery so I couldn't go further than loading the DeltaLake files from GCS in a Dataframe.

Using Javier's comment, I managed to write to BQ, but this solution isn't optimized and as much as I can optimize the Spark Job, it won't be as direct as using a Google Bigquery library that does it under the hood

Not finding any direct solution, I ended up using spark-bigquery-connector to ingest Delta files as follows:

myDeltaTable.toDF
  .write
  .format("bigquery")
  .mode("overwrite")
  .option("parentProject", "MY_PARENT_PROJECT")
  .option("project", "MY_PROJECT")
  .option("temporaryGcsBucket", "A_STAGING_BUCKET_IN_GCS") // Mandatory
  .option("partitionField", "MY_PARTITION_FIELD")
  .save("DATASET.TABLE")

Solution

  • There is no way to ingest a Datalake file in GCS to BigQuery without going through some intermediary.

    You could setup a GCE VM that downloads the Datalake file from GCS, reads it using the Datalake Standalone Connector and then write to BigQuery (either by the streaming API or writing to a supported format like Parquet and importing).

    However this is essentially doing manually the same thing that Spark would be doing.