Search code examples
azureweb-crawlerazure-worker-roles

Creating a Web Crawler using Windows Azure


I want to create a Web Crawler, that takes the content of some website and saves it in a blob storage. What is the right way to do that on Azure? Should I start a Worker role, and use the Thread.Sleep method to make it run once a day?

I also wonder, if I use this Worker Role, how would it work if I create two instances of it? I noticed using "Compute Emulator UI" that the command "Trace.WriteLine" works on both instances at the same time, can someone clarify this point.

I created the same crawler using php and set the cron job to start the script once a day, but it took 6 hours to grab the whole content, thats why I want to use Azure.


Solution

  • This is the right way to do it, as of Jan 2014 Microsoft introduced Azure WebJobs, where you can create a project (console for example), and run it as a scheduled task (occurrence once, recurrence)

    https://azure.microsoft.com/en-us/documentation/articles/web-sites-create-web-jobs/ http://www.hanselman.com/blog/IntroducingWindowsAzureWebJobs.aspx