python json concurrency rabbitmq producer-consumer

Process files as soon as they are downloaded

I have a script which downloads a lot of JSONs. I am processing the JSONs after they are downloaded and I am sending them to some other functions. Presently, I am just waiting till all the JSONs are downloaded and then processing each of them. Is there any way to do this parallelly? Like, as soon as each JSON is downloaded, move it to perform some tasks on it.

I am thinking to use RabbitMQ which sends the consumer the path of the JSON after it has been fully downloaded. I have no clue how to determine whether a JSON has been downloaded and is ready to use it.

I have looked at other answers, but I couldn't find anything clear. I just want an idea on how to continue with the concurrency part or how to take the just downloaded JSON to next process.

Solution

Using some sort of message queue could help to cleanly address this problem, and decouple downloading a JSON and processing a JSON.

In this setup:

[download] -> [MQ] -> [process] -> ??

Each [] would represent a separate process and -> will represent sending some sort of interprocess data.

Your download script could be modified to save each file to a cloud file storage service, and publish a message with the location of that file, when the download is complete.

There could then be a consumer process which reads from the message queue and processes the file.

This will allow you to process each file as it is download. Additionally, it lets you scale the download and the process steps separately.

While this pattern is very common, it comes with operation complexity. You will need to manage 3 separate processes.

If you would like to run this on a single machine you could apply the same pattern locally by having two separate processes:

[download] - writes to stdout
[process_json] - reads from stdin and processes the json

then you could link them using os pipes download.py | process_json.py

download.py would download the file and write the file path. and process_json would operate on a single filepath.