multithreading azure scalability distributed-computing azure-web-roles

Azure web role with distributed background work

I would like to create a wcf in web role to serve smartphone clients.

In addition, I need to create infinite background task that iterate over each member, and send push notification to the member if needed (the decision is made based on some facebook query for each member).

For the moment, the application is new and I don't have clients, and so I don't want more then one VM (for costs savings..), but I in the future i may need to scale so I want to support it. The only requiment I have, is that the iterating over all the members will be completed in intervals of less then 30 minutes.

I was thinking about 2 solutions:

1. To run a classic windows service that runs only in one VM (if I scale out, still one instance for this windows service will be running on one of them only). moreover, i will add an "void Handle(Member member) method in the wcf service. the windows service will have infinite loop that sends wcf requests to this method. this way, if i scale out my instance, the load balancing will distribute the work.

The problem: The windows service doesn't know how many concurrent requests he can send to the wcf service. i really need concurrent handling for the members (up to the limit the server can handle them in parallel of course) because while i wait for one facebook query to complete, i can send more facebook requests for other members (it takes a few seconds for each facebook query to response, because i use facebook batch requests).

2. Each webrole instance will be responsible for handling different members (i'm talking about the background work only). for example, if i have 2 instances (because i scaled out one time), the first instance will be responsible for members with id % 2 == 0, and the second for id % 2 == 1. To achieve it, i was thinking that on the startup of each instance, i register the current instance in some sql table.

In addition, each web role instance will have background thread that:

check the sql table to know which members are on his responsibility
handle those members
set LastHandleTime date
check that if there is a record for an instance that for 5 minutes didn't update its LastHandleTime, remove its record (so someone else will take its members next time)

Edit: There is a problem with this solution also. I cannot keep a background thread running for the all life of my wcf service- the IIS may kill it if there is no activity in my service [for more details: http://forums.asp.net/t/1830688.aspx/1]

Edit2: To solve it, i would run the background thread as a separated windows service. in the future, if needed, i can convert it easly into a worker role. since the web role and the windows service use the same handling logic except that from the first there is a response, and from the second there is push notification (if needed), i will use a shared dll with the handling logic for both of them.

Which solution is better? how do i solve the problems i presented?

Solution

I wouldn't use Windows Services in the cloud, and would avoid using your #2 design as well because it doesn't scale very well. This is the way I would handle your scenario:

I would want to build a Distribution Service (DS) and a Processing Service (SP). The role of the DS is to publish the list of users/members that need to receive notifications in an Azure Queue. The DS can read from the database as you suggest. Using Azure Queues makes it easier to spread workload across worker roles and use a built-in redundancy in which items in a queue reappear if they are not processed in a specific timeframe. The DS could be a background thread in your WCF Service, although I would probably create a dedicated Worker Role for that purpose. If you need to have more than 1 DS for redundancy, use a shared locking mechanism (such as leasing a common Blob or using an Azure Table) so that only 1 DS publishes the list of members to notify in the queue. Each member that needs to receive a notification is represented by a message in an Azure Queue. For example, you could store a user id in the queue: 2019932 - So if you have 10,000 messages to send, you will have 10,000 messages in an Azure Queue.

The SP is another worker role that reads from the queue and processes each item in the queue. The SP can take N items from the queue at a time (let's say 10 for example) and process them. The SP could also have multiple threads (let's say T threads), so each SP could process roughly N * P requests at a time, in parallel. Let's further assume your code take S seconds to process each message (check Facebook, send the notification, update database). To scale, all you need to do is deploy more SPs. Deploying X SPs would give you a throughput of X * N * P / (N * S ) messages per second on average.

You can technically have the DS and the SP in the same worker role; as long as you handle a shared locking mechanism, you don't have to worry about duplicate messages when you deploy your role on multiple machines.

As usual, when you deal with large number of requests, you enter the world of sharding. See this link for the current scalability targets of Azure. For example a single queue can process up to 2,000 messages per second. It's a target... not a guarantee... :) If you think you will need more than this, you can always use multiple queues.