Search code examples
celerygoogle-bigquerypipelineproducer-consumer

Batch processing from N producers to 1 consumer


I have built a distributed (celery based) parser that deal with about 30K files per day. Each file (edi file) is parsed as JSON and save in a file. The goal is to populate a Big Query dataset.

The generated JSON is Biq Query schema compliant and can be load as is to our dataset (table). But we are limited by 1000 load jobs per day. The incoming messages must be load to BQ as fast as possible.

So the target is: for every message, parse it by a celery task, each result will be buffered in a 300 items size (distributed) buffer . When the buffer reach the limit then all data (json data) are aggregated to be pushed into Big Query.

I found Celery Batch but the need is for a production environment but is the closest solution out of the box that I found.

Note: Rabbitmq is the message broker and the application is shipped with docker.

Thanks,


Solution

  • Use streaming inserts, your limit there is 100.000 items / seconds.

    https://cloud.google.com/bigquery/streaming-data-into-bigquery