I'm saving dataframe to bigquery with:
df_full.write.format('bigquery') \
.partitionBy('date_time') \
.mode('overwrite') \
.option('table', 'dataset.reports-full') \
.save()
But when I look table details dataframe not saved as partitioned. Also dataproc job outputs show this:
schemaUpdateOptions=null, autodetect=true, timePartitioning=null, clustering=null, useAvroLogicalTypes=null, labels=null, jobTimeoutMs=null, rangePartitioning=null, hivePartitioningOptions=null, referenceFileSchemaUri=null}. jobId: JobId{project=380911, job=33f2e886-3776-4caf-a336-d16ad8a2c2fd, location=europe-west3}
Is there any way to save the dataframe as partitioned to bigquery with pyspark?
You can use partitionField
, datePartition
, partitionType
partitionField:
If field is specified together with
partitionType
, the table is partitioned by this field. The field must be a top-level TIMESTAMP or DATE field. Its mode must be NULLABLE or REQUIRED.
-- Docs
For Clustering use - clusteredFields
See partitioning:
https://github.com/GoogleCloudDataproc/spark-bigquery-connector#configuring-partitioning
See more options:
https://github.com/GoogleCloudDataproc/spark-bigquery-connector#properties