Search code examples
firebasegoogle-cloud-firestoregoogle-bigqueryfirebase-extensions

Delete old rows in change log for Firestore to BigQuery Extenstion


I am using the Firebase Extension Stream Firestore to BigQuery to export my data to BQ. It is working great, and I have been using it for a couple of years now. However, the table size is starting to get larger causing an increase in cost for my queries. I have the following tables created by this and npx @firebaseextensions/fs-bq-schema-views:

  • mytable_raw_changelog (1,327,594 rows)
  • mytable_raw_latest (55,329 rows)
  • mytable_schema_changelog (1,332,454 rows)
  • mytable_schema_latest (55,745 rows, 3.2GB query)

A single SELECT from the mytable_schema_latest table now processes 3.2GB

The options I can think of are

  1. Remove old rows from the _changelog that are no longer needed
  2. Add time partitioning using the Firebase BQ extension config
  3. Create a job to copy the mytable_schema_latest to another table for querying (without the historic changes)

I am looking to go with Option 1 - remove old rows for now as this seems the simplest. Is there a recommended way or any guides on how to do this?

I only want to remove old historic changes that occurred before a certain date where the document has since been updated.

It may also be a good idea to copy these to a historic table before deleting them in case I need them in the future.

Update 2024-03-28: I noticed there is a new parameter with the latest version of the extension. Not much info but maybe this will do the trick?

Exclude old data payloads If enabled, table rows will never contain old data (document snapshot before the update), which should be more performant, and avoid potential resource limitations.

Update 2024-05-02: I used Gregs suggestion below and it worked well. Although due to the possibility of duplicate document_id I needed to use document_name which is the full document path and is actually unique.

DELETE FROM your_dataset.your_table_changelog
  WHERE STRUCT(document_name, timestamp) NOT IN (
      SELECT AS STRUCT document_name, MAX(timestamp) AS timestamp
      FROM your_dataset.your_table_changelog
      GROUP BY document_name
  )

Update 2024-05-06: Turns out I needed to add the where at the bottom as I was getting the error: UPDATE or DELETE statement over table ... would affect rows in the streaming buffer, which is not supported

DELETE FROM your_dataset.your_table_changelog
      WHERE STRUCT(document_name, timestamp) NOT IN (
          SELECT AS STRUCT document_name, MAX(timestamp) AS timestamp
          FROM your_dataset.your_table_changelog
          GROUP BY document_name
      )
  AND timestamp < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 45 MINUTE)

See: https://stackoverflow.com/a/53495209/10222449


Solution

  • To create a "backup" of a given table, you can simply run a query and then use SAVE RESULTS >> BigQuery Table to save to a new table. Now you have a "backup" and if something goes wrong you can restore from this new table.

    I think the following SQL statement might delete "old" rows for you:

    DELETE FROM your_dataset.your_table
      WHERE STRUCT(document_name, timestamp) NOT IN (
          SELECT AS STRUCT document_id, MAX(timestamp) AS timestamp
          FROM your_dataset.your_table
          GROUP BY document_name
      )