apache-spark google-cloud-platform pyspark google-bigquery

Pyspark write to bigquery with partitioned not saving as partitioned

I'm saving dataframe to bigquery with:

    df_full.write.format('bigquery') \
    .partitionBy('date_time') \
    .mode('overwrite') \
    .option('table', 'dataset.reports-full') \
    .save()

But when I look table details dataframe not saved as partitioned. Also dataproc job outputs show this:

schemaUpdateOptions=null, autodetect=true, timePartitioning=null, clustering=null, useAvroLogicalTypes=null, labels=null, jobTimeoutMs=null, rangePartitioning=null, hivePartitioningOptions=null, referenceFileSchemaUri=null}. jobId: JobId{project=380911, job=33f2e886-3776-4caf-a336-d16ad8a2c2fd, location=europe-west3}

Is there any way to save the dataframe as partitioned to bigquery with pyspark?

Solution

You can use partitionField, datePartition, partitionType

partitionField:

If field is specified together with partitionType, the table is partitioned by this field. The field must be a top-level TIMESTAMP or DATE field. Its mode must be NULLABLE or REQUIRED.

-- Docs

For Clustering use - clusteredFields

See partitioning:

https://github.com/GoogleCloudDataproc/spark-bigquery-connector#configuring-partitioning

See more options:

https://github.com/GoogleCloudDataproc/spark-bigquery-connector#properties