Search code examples
pythongoogle-bigquerygoogle-cloud-storagecloudstorage

Google Cloud BigQuery WRITE_APPEND duplicate issue


I keep multiple data with similar name in Google Cloud Storage. My data comes here daily via API and I want to add them to my table in BigQuery, which is refreshed daily, and I do this via Python. When I do WRITE_TRUNCATE, it deletes the table and creates a new table with all the files in the storage, but I need to protect my table because I have historical data in it and they are not in the storage. When I WRITE_APPEND, I have a duplicate problem because it adds all the files in the storage. Anyone here have a suggested solution?

here is a piece of my code:

    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
    job_config.skip_leading_rows = 1
    # The source format defaults to CSV, so the line below is optional.
    job_config.source_format = bigquery.SourceFormat.CSV
    job_config.autodetect = True
    job_config.max_bad_records = 5
    uri = "gs://" + gcsbucket + "/" + tableprefix ```

Solution

  • @AhmetBuğraBUĞA, As you have mentioned in the comment, Date names were written at the end of the data and the problem can be solved by adding dates in the BigQuery as below which takes a single day from the dates.

    (dt.datetime.today() - dt.timedelta(day=1)).strftime("%Y-%m-%d")