join google-cloud-dataflow apache-beam google-cloud-bigtable spotify-scio

Join batch data with data stored in BigTable

I have growing data in GCS and will have a batch job that runs lets say every day to process 1 million of articles increment. I need to get additional information for the keys from BigTable (containing billions of records). Is it feasible to do just a lookup with every item in map operation? Does it make sense to batch those lookups and perform something like bulk read? Or what is the best way for this use case using scio/beam?

I found in Pattern: Streaming mode large lookup tables that performing lookup on every request is recommended approach for streaming, however I'm not sure if I wouldn't overload BigTable by the batch job.

Do you guys have any overall or concrete recommendation how to handle this use case?

Solution

I've helped others do this before, but in base Database / Beam. You'll need to aggregate the keys in batches fo optimal performance. Somewhere between 25 - 100 keys per batch would make sense. If you could pre-sort the list so that your lookups are more likely to hit fewer Cloud Bigtable nodes per request.

You can use the Cloud Bigtable client directly, just make sure to use the "use bulk" setting, or have a singleton to cache the client.

This will definitely have an impact on your Cloud Bigtable cluster, but I couldn't tell you how much. You may need to increase the size of your cluster so that other uses of Cloud Bigtable don't suffer.