Search code examples
google-bigquerypython-polars

How to write a dataframe to BigQuery and overwrite partition instead of the table?


I need to write a polars dataframe into a BigQuery table. The table is partioned by date.

When I need to run a backfilling script I iterate over a date range, get the data from some source (API in this case), convert it into a dataframe, manipulate a bit and write it into the BQ table.

But instead of overwriting the partition for that date it overwrites the whole table?

How can I only overwrite the partition?

My code so far:

import polars as pl
from google.cloud import bigquery

# create period_range from internal util_package

for date in period_range:
    data =  "get some API data here per date"

    df = pl.read_csv(data).select(pl.col(pl.INT64)

    client = bigquery.Client()
    with io.BytesIO() as stream:
        df.write_parquet(stream)
        stream.seek(0)
        job = client.load_table_from_file(
            file_obj=stream,
            destination="analytics.ads.vendor_name",
            project="mycompany_ads",
            location="EU",
            job_config=bigquery.LoadJobConfig(
                 source_format=bigquery.SourceFormat.PARQUET,
                 time_partitioning=bigquery.TimePartitioning(
                     type_=bigquery.TimePartitioningType.DAY,
                     field="date",  # name of column to use for partitioning
                     require_partition_filter=True,
                 ),
                 clustering_fields=["domain", "type", "placement"],
                 autodetect=True,
                 schema=None,
                 write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
           ),
      )
    job.result()  # Waits for the job to complete
    print("ETL finished")

Solution

  • I believe these two questions should solve your issue

    Disclaimer: I have not tested the solution mentioned

    Documentation on the partition reference with a $ here