I need to write a polars dataframe into a BigQuery table. The table is partioned by date.
When I need to run a backfilling script I iterate over a date range, get the data from some source (API in this case), convert it into a dataframe, manipulate a bit and write it into the BQ table.
But instead of overwriting the partition for that date it overwrites the whole table?
How can I only overwrite the partition?
My code so far:
import polars as pl
from google.cloud import bigquery
# create period_range from internal util_package
for date in period_range:
data = "get some API data here per date"
df = pl.read_csv(data).select(pl.col(pl.INT64)
client = bigquery.Client()
with io.BytesIO() as stream:
df.write_parquet(stream)
stream.seek(0)
job = client.load_table_from_file(
file_obj=stream,
destination="analytics.ads.vendor_name",
project="mycompany_ads",
location="EU",
job_config=bigquery.LoadJobConfig(
source_format=bigquery.SourceFormat.PARQUET,
time_partitioning=bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field="date", # name of column to use for partitioning
require_partition_filter=True,
),
clustering_fields=["domain", "type", "placement"],
autodetect=True,
schema=None,
write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
),
)
job.result() # Waits for the job to complete
print("ETL finished")
I believe these two questions should solve your issue
Disclaimer: I have not tested the solution mentioned
Documentation on the partition reference with a $
here