Search code examples
asynchronousarchitecturemicroservicesjobssystem-design

crawler design - calling an async job vs. calling a service


I'm looking at donne martin's design for a web crawler. the crawler service processes a newly crawled url, and then:

  • Adds a job to the Reverse Index Service queue to generate a reverse index
  • Adds a job to the Document Service queue to generate a static title and snippet

what would happen if instead the crawler service would synchronously call these 2 services? I would still be able to horizontally scale all 3 services according to the load on each, right? what came to me as a possible reason is just more complex flow control if one of them fails. are there other more compelling reasons for these async jobs?


Solution

  • what would happen if instead the crawler service would synchronously call these 2 services?

    The first point — then the slowest service will become a bottleneck for the crawler. Synchronous call means that the crawler needs to wait for the request to be processed by the service. In case of queue, crawler will be working faster, processing new links and not waiting for other services. We could assume that the crawler could have its own internal queue tho.

    The second point — durability. Maybe it's not that important if one link or several will be lost if any of the services will get down and wouldn't be able to process a request from the crawler. But queues can be durable, saving state on the disk, restoring its work at the point where it's been stopped. Could be very useful if all services will go down at the same time and many links will be lost.

    what came to me as a possible reason is just more complex flow control if one of them fails

    That approach isn't flexible. Normally you should be able to add as many new services as you want easily to scale workload, without any changes in code. So the “flow control” should not exist as code that needs modification each time you add or remove instances of a service. In real applications that can scale up and down, all such things are done automatically without redeploy of the application.