architecture synchronization queue web-crawler servicebus

Synchronize access of queue workers

I'm currently writing a service which crawls DotA 2 matches using the Steam Web API. Because I want my solution to be scalable, I want to allow the crawling jobs to be buffered and processed concurrently. That's why a thought of a queue:

Crawling architecture

All of the components should be able to run on different computers/VMs (no in-memory or inter-process synchronization). Crawling jobs could be something like this:

Job 1: Crawl match 1234 with options ABC
Job 2: Crawl match 2345 with options BCD

Because of the nature of the data, multiple jobs pointing to the same match might be enqueued (e.g. two players played the same game). Therefore, I need some synchronization mechanism which a queue can't provide (crawlers must not attempt to write data of the same match at the same time).

My actual question is: is there a pattern which can be used to synchronize queue workers which need to access the same data?

One approach I thought of was introducing another service which allows the crawlers to Lock matches (which needs to be done before reading or writing match data from the database):

Crawling controller

But that would introduce a whole bunch of new questions and requirements:

How to scale the controller?
What if the controller crashes?
What if a queue worker does not unlock a match?
...

If it is of interest, here's the technologies I'll probably use:

Queue: Service Bus for Windows Server
Services: .NET Web API
DB: SQL Server 2012

Solution

This sounds like a reservation system, the sort of problem that on line ticket booking systems have -

user asks for tickets
system offers specific tickets
user thinks a while and maybe pays, during that think time system cannot offer tickets to anyone else
eventually user buys, rejects or maybe just times out
system updates ticket availability

Question: in your system is it a problem if two crawlers, with same parameters are searching at the same time, providing that they cannot update the results at the same time? The reason I ask is that I perceive that crawling action itself as being analogous to user think-time, a long-running action for whose duration it's not reasonable to hold a database lock.

The scheme I'd propose is optimistic locking, mediated by the database and database transctions hence no need for a separate controller - your DB is a single point of failure and ultimately a scalability bottleneck, but you can address that by some partitioning of the DB.

You need some kind of controller. But it need not be a singleton. Again mediate the instances via database locks. The big problem I see is reliably to catch failed crawlers. It's easy enough to maintain a DB table of running crawlers in the "blue-sky" scenarios. It's the failure cases that seem to me very tricky.

I wonder whether the trick is to partition the database, each partition corresponding to a "workgroup" with its own controller. So long as the controller is alive it can initiate work and can police the queries so that duplicates don't occur in its workgroup. On completion of any crawler a "ready" message is queued, and a result consolidation service pulls data from the partition into the master.