Search code examples
google-cloud-platformgoogle-bigquerypartitioning

Unable to query BigQuery external table on partitioned data with spark


I was trying to create an external table on the partitioned data in GCS written from a spark job with date partitioned in PARQUET format.

The data is the GCS bucket is as shown the picture .

enter image description here

I created an external table with table definition

  "hivePartitioningOptions": {
    "mode": "AUTO",
    "sourceUriPrefix": "gs://transaction_data_bucket_for_bigquery/trx_data"
  },
  "sourceFormat": "PARQUET",
  "sourceUris": [
    "gs://transaction_data_bucket_for_bigquery/trx_data/*"
  ]
}

with command

bq mk --external_table_definition=/tmp/table_def <project>:<dataset>.sample_trx_external

When i try to query the table, i am getting a strange error.

Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 0 partition keys ([]), but 1 ([transaction_date]) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data/transaction_date=2016-01-01.; Cannot add hive partitioning to table <data_set>.sample_trx_external -- table creation from underlying uri failed.. Underlying error: Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 0 partition keys ([]), but 1 ([transaction_date]) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data/transaction_date=2016-01-01..

Unfortunately i couldn't able to decipher the message. There are only 1 day of transactions written to the GCS bucket .

When i tried with CUSTOM mode definition

{
  "hivePartitioningOptions": {
    "mode": "CUSTOM",
    "sourceUriPrefix": "gs://transaction_data_bucket_for_bigquery/trx_data/{transaction_date:DATE}"
  },
  "sourceFormat": "PARQUET",
  "sourceUris": [
    "gs://transaction_data_bucket_for_bigquery/trx_data/*"
  ]
}

i got a slightly different error

Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 1 partition keys ([transaction_date]), but 0 ([) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data.; Cannot add hive partitioning to table <data_Set>.sample_trx_external_2 -- table creation from underlying uri failed.. Underlying error: Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 1 partition keys ([transaction_date]), but 0 ([) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data..

Got struck here and any suggestion would be a great help .


Solution

  • As shown in the attached image in the question there is a file _SUCCESS which was written by spark job which created the partitioned data set . The problem here is under "gs://transaction_data_bucket_for_bigquery/trx_data/" path big-query external table is expecting every directory or file as a partitioned format. Definitely _SUCCESS doesn't fallow this structure which resulted in the above error message.

    Fix: I just deleted the file and every thing works as expected .