I am building an application which takes in requests from users via web interface and then performs some processing and returns the result whenever available.
Here is a simple overview of the current architecture:
The web application adds the request to multiple collections in MongoDB with the processed
field set to False. Then there are processing servers for each collection, which poll their collections to check if there are any unprocessed entries. If yes, the servers then perform the processing which takes some time and also some cost (external API calls) and then save the result back in the database (output_data
)and set processed
to True.
Now, the problem I have:
I am unable to scale up the processing servers for each module, because if I run two servers then there is a chance that the same entry is processed twice and would incur more cost for me.
I also want to decouple the processing servers from the database, as I want to use the same processing servers with different databases too (ex: for different customers)
I do not know much about queues and pub/sub architecture. I think some sort of queue architecture would be useful in achieving the above, but not sure how to handle duplicate messages.
Please let me know what architecture would be useful in avoiding the problems above. I would prefer the solution to be cloud provider agnostic, but if really needed I would like to go with AWS.
Update: My current development stack is Python, Flask, MongoDB, Docker.
I would suggest you to use a message queue, which will solve many of your problems. For example, RabbitMQ, here you can find python libraries for working with it.
Instead of polling, your working processes will simply wait for a new message to arrive, eliminating the problem of duplicate processing. They can also send the result back to the message queue and the saving worker will save them to any (different) databases. I find that introducing message queues fits well with your architecture as publish/subscribe pattern.