Search code examples
apache-sparkgoogle-bigqueryparquet

Load partitioned (spark) parquet to a bigquery table


I have data written out from spark, to parquet files in gcs, partitioned on a date column. The data in gcs look like this:

gs://mybucket/dataset/fileDate=2019-06-17/000.parquet
gs://mybucket/dataset/fileDate=2019-06-17/001.parquet

I'd like to load this to bigquery, such that the column fileDate in the resulting (partitioned) table is populated from the path. How can I do this?

What I've tried so far, is to create the table and load the data to it with

bq --location=US load --source_format=PARQUET 'workspace:marcin_test.dataset_table' 'gs://mybucket/dataset/fileDate=2019-06-17/*'

This works in so far as data are loaded, but fileDate is null.


Solution

  • I am assuming that they are hive partitioned since they look like this but correct me if I am wrong. Try this:- bq load --source_format=PARQUET --autodetect --hive_partitioning_mode=AUTO --hive_partitioning_source_uri_prefix=gs://mybucket/dataset/ project-id:dataset_name.table_name gs://mybucket/dataset/fileDate=2019-06-17/*.parquet

    Reference: https://cloud.google.com/bigquery/docs/hive-partitioned-loads-gcs It should work.