Search code examples
google-cloud-platformgoogle-bigqueryparquet

Bigquery export as parquet file partitioning


Whenever I need to export data from Bigquery to parquet, I find myself in the following situation, either I use:

  1. dask-bigquery: which takes around 40 min for my dataset; outputing 700 files of ~12MB; or
  2. Bigquery's EXPORT DATA SQL statement letting BQ to exec the export: takes around 2 min; but generates 10.000 files of ~200KB, which then makes it unusable to work with, without an expensive repartitioning operation.

Is there a way to get the best of both worlds? i.e. using the export statement in BQ while configuring / optimising the partitioning strategy?


Solution

  • You can test the performance of adding a top level sort to one or more columns in the "EXPORT DATA" statement which often has the side effect of producing more reasonably sized files (e.g sorting by clustering/partition columns)