I am new to data ingestion but have worked through some examples of using Google Data Flow in batch and stream mode. I am now ready to build the actual project. However, I need to choose one of multiple architectures to achieve my goal.
To call an API as to extract data. The data needs to be processed and loaded onto a Big Query table.
Here we go into the finer details:
It is the Easylog Cloud API. I can successfully make calls from it using the requests library in Python. The API has multiple locations with multiple devices. Each device samples a reading every 60 seconds. The number of devices may increase with time. Assume that I am able to filter only readings that have not been seen from every call to the API. I will most likely be making a call to the API every 2 minutes (the team still needs to decide).
My function that calls the API already does most of the data processing and transformations. So I do not actually have the need to do any transforms in say an Apache Beam pipeline.
The data needs to be read into a Big Query table. This can either be streamed or batch. I assume a batch insert is cheaper. The answer by Pravin Dhinwa on this question suggests a batch insert to be free.
Here I discuss some possible solutions. Keep in mind that they might be rubbish. The goal of this question is to identify whether one of these are good and to suggest a better one.
I could use Google Cloud Functions to pull from the API and directly write/insert to the Big Query table. This would be a periodically scheduled function using Cloud Scheduler. I might write some temporary data into Cloud Storage buckets from this script as well. This approach seems simple. However, due to the simplicity, I question its robustness and potential to be costly.
I can deploy a Google Cloud Data Flow pipeline using Apache Beam. The pipeline will have a ParDo function which pulls from the API. The pipeline then inserts into the Big Query table and writes some temporary data to Google Cloud Storage buckets. The pipeline is triggered by periodic Pub/Sub messages. These Pub/Sub messages need to be published by a scheduled job on Cloud Scheduler. I feel like this approach maintains a running pipeline such that it does not need to be set up periodically.
This is almost identical to solution 2. However, I do not pull from the API using a ParDo function in the pipeline. Instead I schedule a Cloud Function to deploy periodically using Cloud Scheduler. This function pulls from the API and sends the results to a Pub/Sub topic. These messages trigger the pipeline which writes to the Big Query table and buckets. This approach seems like an overkill version of solution 1 though by adding an intermediate pipeline whereas solution 1 does everything in the cloud function.
I am looking for a solution which fulfills the following requirements with descending priority importance:
I would recommend Solution 2 since Dataflow is robust for your streaming pipeline. You do not need to maintain it once it works.