I've set sink to transfer logs from Google Cloud Logging to BigQuery. Everything works fine, but sometimes there will be duplicated rows from the same log in Cloud Logging
Here is an example for cloud Logging. There's only one log here.
And here's what I get when I query this record from BigQuery with insertId: 1fw0b92g26o229x
Anybody has the same issue? and can I prevent this duplication.
Thanks
Duplicates can occur when there are failures in streaming logs to BQ, or it could also happen anywhere during upstream including the client side. BQ currently does not de-duplicate the data. If the receiveTimestamp are all the same, the duplicates would have occurred somewhere in the logging pipeline or inside the BigQuery streaming ingestion. There is currently no way to have a perfect de-duplication at ingest time. The duplicates will need to be removed at query time.
You don't see the duplicates in the Logs Viewer because it de-duplicates the logs with the same timestamp and insertId at query time. It might be possible to do the same in BQ by querying it with GROUP BY as in the below example query.
SELECT timestamp,severity,insertId
FROM project-id.my_dataset.my_table
GROUP BY timestamp,severity,insertId