Search code examples
amazon-web-servicesgoogle-bigquerytime-seriesdashboardaws-documentdb

Transfer/Replicate Data periodically from AWS Documentdb to Google Cloud Big Query


We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.

The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.

I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.


Solution

  • This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.

    Using Change Streams with Amazon DocumentDB: https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html

    This document also contains a sample Python code for consuming change streams events.