Search code examples
firebasegoogle-cloud-platformgoogle-cloud-firestoregoogle-bigqueryfirebase-extensions

BigQuery Firestore extension. Not to remove data from BigQuery


I have the Firestore to BigQuery extension installed and working in my environment. I have it streaming all the documents in a collection to BigQuery. It works well, when I create a doc, append a doc or delete a doc in Firestore, the extension dually mirrors the change. Whilst keeping the extension's create and append functions, I want to remove the delete functionality. I'm using Firestore to record the user's actions - each doc contains all the details from the user's session. This data is written and appended into BigQuery for analysis. In certain circumstances the user can delete their session data which then deletes the doc from Firestore. I want to stop the extension then removing the data from BigQuery. How can I do this? Is it a permissions thing or is there a way of configuring the extension not to delete the datas?


Solution

  • Is it a permissions thing or is there a way of configuring the extension not to delete the datas?

    No it is not a matter of permission nor a matter of configuration (see here all the parameters that you can configure). You simply cannot configure the extension to have this "behaviour".

    I can see two possible solutions:

    1. Adapt the code of the BigQuery view which represents the current state of the data within your collection:

    As explained in the Extension doc:

    The extension creates and updates a dataset containing the following two BigQuery resources:

    1/ A table of raw data that stores a full change history of the documents within your collection. This table includes a number of metadata fields so that BigQuery can display the current state of your data. The principle metadata fields are timestamp, document_name, and the operation for the document change.

    2/ A view which represents the current state of the data within your collection. It also shows a log of the latest operation for each document (CREATE, UPDATE, or IMPORT).

    In the BigQuery console (in the Google Cloud console) you'll find the code of the view: Select the view (it has the _raw_latest suffix) and click on the "Detail" tab: You'll find the view SQL code which is as follows:

    -- Retrieves the latest document change events for all live documents.
    --   timestamp: The Firestore timestamp at which the event took place.
    --   operation: One of INSERT, UPDATE, DELETE, IMPORT.
    --   event_id: The id of the event that triggered the cloud function mirrored the event.
    --   data: A raw JSON payload of the current state of the document.
    --   document_id: The document id as defined in the Firestore database
    SELECT
      document_name,
      document_id,
      timestamp,
      event_id,
      operation,
      data,
      old_data
    FROM
      (
        SELECT
          document_name,
          document_id,
          FIRST_VALUE(timestamp) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS timestamp,
          FIRST_VALUE(event_id) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS event_id,
          FIRST_VALUE(operation) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS operation,
          FIRST_VALUE(data) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS data,
          FIRST_VALUE(old_data) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) AS old_data,
          FIRST_VALUE(operation) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) = "DELETE" AS is_deleted
        FROM
          `...`
        ORDER BY
          document_name,
          timestamp DESC
      )
    WHERE
      NOT is_deleted
    GROUP BY
      document_name,
      document_id,
      timestamp,
      event_id,
      operation,
      data,
      old_data
    

    Remove the following WHERE clause

    WHERE
      NOT is_deleted
    

    and the following statement

          FIRST_VALUE(operation) OVER(
            PARTITION BY document_name
            ORDER BY
              timestamp DESC
          ) = "DELETE" AS is_deleted
    

    and it should do the trick.

    Note that the data of the deleted doc will be in the old_data column and not in the data one. Also note that you can very well create a new view based on the adapted SQL code and let the initial view (created by the extension) untouched.

    2. Use an intermediary Collection

    Instead of configuring the extension on your "actual" collection, just create another collection and mirror, in this new collection, all the document creations and updates but not the deletions.

    To do the mirroring you use a Cloud Function that copies any new document from the main actual collection to the mirrored collection, as well as update a mirrored document each time its corresponding doc changes in the main actual collection.


    A third approach, much more complex, would be to adapt the code of the extension: The code is open source and available here. Adapt it to avoid deleting the Big Query records when a Firestore doc is deleted and deploy it as your own Cloud Function.

    I would advise going with the first solution.