Whenever I need to export data from Bigquery to parquet, I find myself in the following situation, either I use:
dask-bigquery
: which takes around 40 min for my dataset; outputing 700 files of ~12MB; orEXPORT DATA
SQL statement letting BQ to exec the export: takes around 2 min; but generates 10.000 files of ~200KB, which then makes it unusable to work with, without an expensive repartitioning operation.Is there a way to get the best of both worlds? i.e. using the export statement in BQ while configuring / optimising the partitioning strategy?
You can test the performance of adding a top level sort to one or more columns in the "EXPORT DATA" statement which often has the side effect of producing more reasonably sized files (e.g sorting by clustering/partition columns)