Currently I'm investigating possibility to use Azure Service Fabric and its Reliable Services in order to implement my problem domain architecture.
Problem domain: I am currently doing a research on distributed large-scale web crawling architectures involving dozens of parallel agents that should crawl web-servers and download resources for further indexing.
I've found useful academic paper which describes Azure-based distributed web-crawling architecture: Link to .pdf paper and I'm trying to implement and try out prototype based on this design.
So basic high-level look of design is something like this figure below:
The idea: Central Web Crawling System Engine (further - CWCE) runs in an infinite loop until program is aborted and fetches Service Bus Queue Message which contains URL of page to be crawled. CWCE component then checks hostname of this URL and consults Agent Registrar SQL database if alive agent already exists for given hostname. If not, CWCE then does one of the following procedures:
If number of alive agents (A_alive) is equal to Max value (upper bound limit of agents, provided by application administrator) CWCE waits until A_alive < Max value
If A_alive < Max, CWCE tries to create new Agent and assign hostname to it. (agent is then registered in SQL Registrar database).
Each Agent runs on its own partition (URL hostname, for example: example.com) and recursively crawls only pages of this hostname while discovering external hostnames URLs and adding them to Service Bus Queue for other agent processings.
The benefit of this architecture would be horizontal scaling of agents and near-linear workload increase of crawling effectiveness.
However, I am very new in Azure Service Fabric and therefore would like to ask if this PaaS layer is capable of solving this problem? Main questions:
Would it be possible to manually create new web crawling agent instances through the programmable code and pass them hostname parameter using Azure Service Fabric? (Maybe using FabricClient class for manipulating cluster and creating service instances?)
Which ASF programming model fits this parallel long-running agents scenario the best? Stateless services, stateful services or Actor Model? Each agent might run as long-running task, since it recursively crawls specific hostname URLs and listens for the queue.
Would it be possible to control and change this upper bound limit of Max alive agents during runtime of application?
Would it be possible to have infinite-loop stateless service CWCE component which continuously listens for the queue messages in order to spawn up new agents?
I am not sure whether the selected ASF PaaS layer is the best solution for this distributed web-crawling system use-case, so your insights would be so much valuable for me. Any helpful resource links would also be so beneficial.
Service Fabric will allow you to implement the architecture that you want.
- Would it be possible to manually create new web crawling agent instances through the programmable code and pass them hostname parameter using Azure Service Fabric? (Maybe using FabricClient class for manipulating cluster and creating service instances?)
Yes. The service you will develop and deploy to Service Fabric will be a ServiceType
. Service Types don't actually run, instead, from the ServiceType you can create the actual Services, which are named. A single Service (eg ServiceA), will have a number of Instances, to allow scaling and availability. You can programmatically create and remove services of a given type and pass parameters to them, so every service will know what URL to crawl.
Check an example here.
- Which ASF programming model fits this parallel long-running agents scenario the best? Stateless services, stateful services or Actor Model? Each agent might run as long-running task, since it recursively crawls specific hostname URLs and listens for the queue.
I would choose Stateless services, because they will be the most efficient in terms of resource utilization and the easiest to manage (no need to store state and manage state, partitioning and replicas). The only thing you need to consider is that every service will eventually crash and restart, so you need to store the current crawling location in a permanent store, not in memory.
- Would it be possible to control and change this upper bound limit of Max alive agents during runtime of application?
Yes. Service Fabric services run in Nodes (Virtual Machines) and in Azure, they are managed by Virtual Machine Scale Sets. You can easily add and remove nodes from the VMSS which Will allow you to adjust the total compute and memory power that you want and the actual number of services is already controlled by you as specified in point 1.
- Would it be possible to have infinite-loop stateless service CWCE component which continuously listens for the queue messages in order to spawn up new agents?
Absolutely. Message-driven microservices are very common. It's technically not an infinite loop, but a service with a Bus Communication Listener. I found one here as a reference, but I don't know if it's production ready