Search code examples
azurearchitecturetransactionsazure-worker-rolesidempotent

Multiple Instances of Azure Worker Roles for non-transaction integration tasks


We have an upcoming project where we'll need to integrate with 3rd parties over a variety of transports to get data from them.

Things like WCF Endpoints & Web API Rest Endpoints are fine.

However in 2 scenario's we'll need to either pick up auto-generated emails containing xml from a pop3 account OR pull the xml files from an External SFTP account.

I'm about to start prototyping these now, but I'm wondering are there any standard practices, patterns or guidelines about how to deal with these non-transactional systems, in a multi-instance worker role environment. i.e.

What happens if 2 workers connect to the pop account at the same time or the same FTP at the same time.

What happens if 1 worker deletes the file from the FTP while another is in mid-download.

Controlling duplication shouldn't be an issue, as we'll be logging everything on application side to a database, and everything should be uniquely identifiable so we'll be able to add if-not-exists-create-else-skip logic to the workers but I'm just wondering is there anything else I should be considering to make it more resilient/idempotent.


Solution

  • Just thinking out loud, since the data is primarily files and emails one possible thing you could do is instead of directly processing them via your worker roles first thing you do is save them in blob storage. So there would be some worker role instances which will periodically poll the POP3 server / SFTP site and pull the data from the there and push them in blob storage. When the blob is written, same instance can delete the data from the source as well. With this approach you don't have to worry about duplicate records because blob will be overwritten (assuming each message/file has a unique identifier and the name of the blob is that identifier).

    Once the file is in your blob storage, you can write a message in a Windows Azure Queue which has details about this blob (may be blob URL etc.). Then using 'Get' semantics of Windows Azure Queues, your worker role instances start fetching and processing these messages. Because of Get semantic, once a message is fetched from the queue it becomes invisible to other callers (worker roles instances in this case). This way you could take care of duplicate message processing.

    UPDATE

    So I'm trying to combat against two competing instances pulling the same file at the same moment from the SFTP

    For this, I'll pitch my favorite Master/Slave Concept:). Essentially the idea is that each instance will try to acquire a lease on a single blob. The instance which acquires the lease becomes the master and others slave. Master would fetch the data from SFTP while slaves will wait. I've described this concept in my blog post which you can read here: http://gauravmantri.com/2013/01/23/building-a-simple-task-scheduler-in-windows-azure/, though the context of the blog is somewhat different.