I am working on a project. The project is to constantly check the sites that users will visit (such as checking every 3 minutes). I created my project from 2 different services. One is the service that will communicate with the user (I developed it with NodeJS). The other is a Python service that will perform background checks and write the necessary status notifications to the database.
I think Python is strong in this regard with its libraries. The issue I'm stuck on is this: suppose users enter too many addresses after a certain time. In this case, how can I do the control at the same time?
The first thing that came to my mind was thread and I created the following solution;
threads = []
for website in websites:
thread = threading.Thread(target=check, args=(website[0],website[1],website[2],))
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
This code checks 200 URL addresses (test data) registered in the database. It uses Thread
to control it at the same time, but since it uses too many Threads
here, I think it is against the use of Thread
. In short, I think there should be a healthier method of this method. Which path should I follow in this case? What methods do I need to modify and work on?
Thanks.
In short, I think there should be a healthier method of this method. Which path should I follow in this case? What methods do I need to modify and work on?
In my view, a more scalable solution would involve implementing a task queue, considering the inherent limitations of a single computer's processing capacity. For example, if each instance of the service can handle up to 200 URLs simultaneously, and we have a total of 800 URLs to check, we can segment these URLs into batches of 200 and leverage a tool like Celery to distribute the workload efficiently.
given the nature of network operations involved, each worker should implement parallelism techniques, such as threads or coroutines, to maximize performance.