Search code examples
apache-sparkgoogle-cloud-platformpysparkgoogle-bigquery

Pyspark write to bigquery with partitioned not saving as partitioned


I'm saving dataframe to bigquery with:

    df_full.write.format('bigquery') \
    .partitionBy('date_time') \
    .mode('overwrite') \
    .option('table', 'dataset.reports-full') \
    .save()

But when I look table details dataframe not saved as partitioned. Also dataproc job outputs show this:

schemaUpdateOptions=null, autodetect=true, timePartitioning=null, clustering=null, useAvroLogicalTypes=null, labels=null, jobTimeoutMs=null, rangePartitioning=null, hivePartitioningOptions=null, referenceFileSchemaUri=null}. jobId: JobId{project=380911, job=33f2e886-3776-4caf-a336-d16ad8a2c2fd, location=europe-west3}

Is there any way to save the dataframe as partitioned to bigquery with pyspark?


Solution

  • You can use partitionField, datePartition, partitionType

    partitionField:

    If field is specified together with partitionType, the table is partitioned by this field. The field must be a top-level TIMESTAMP or DATE field. Its mode must be NULLABLE or REQUIRED.

    -- Docs

    For Clustering use - clusteredFields

    See partitioning:

    https://github.com/GoogleCloudDataproc/spark-bigquery-connector#configuring-partitioning

    See more options:

    https://github.com/GoogleCloudDataproc/spark-bigquery-connector#properties