Search code examples
google-cloud-platformgoogle-cloud-storagegoogle-cloud-pubsubgoogle-cloud-runcloud-storage

Process 10req/s and save to cloud storage - recommended method?


I have 10 requests per second of data I want to save that looks like the entry below. I need to save this data after a CloudRun function completes. (My infrastructure is on google-cloud-platform). The data will be used as a data set for machine learning.

{ 
  "text": "1k characters", 
  "text2": "1k characters", 
  "metadata1": "enum (100 vals)", 
  "metadata2": "number value" 
}

I planned to save this as a non-awaited function to google-cloud-storage either in one folder or in folders based on the metadata1 enum. Is either better than the other?

Is this the appropriate route to take?

I think pubsub is overkill as suggested in this SO answer.


Solution

  • I can propose you 2 patterns, but in both case you need to store the messages:

    • Either use PubSub to stack the messages. Then, use Dataflow to read pubsub and to sink to Cloud Storage. Or use a on demand service (Cloud Run for exemple) to pull your PubSub subscription and write a file with all the message read (You can trigger your Cloud Run with Cloud Scheduler, every hour for example)
    • Or store the message in BigQuery, and then perform query export to GCS regularly (again with a Cloud Scheduler + Cloud Functions/Run). It's my preferred solution, because, maybe a day, you will have to process differently your message, and to get metrics/perform analytics on them.