automatic failover if webserver is down (SRV / additional A-record / ?)

I am starting to develop a webservice that will be hosted in the cloud but needs higher availability than typical cloud SLAs provide.

Typical SLAs, e.g. Windows Azure, promise an availability of 99.9%, i.e. up to 43min downtime per month. I am looking for an order of magnitude better availability (<5min down time per month). While I can configure several load balanced database back-ends to resolve that part of the issue I see a bottleneck at the webserver. If the webserver fails, the whole service is unavailable to the customer. What are the options of reducing that risk without introducing another possible single point of failure? I see the following solutions and drawbacks to each:

SRV-record: I duplicate the whole infrastructure (and take care that the databases are in sync) and add additional SRV records for the domain so that the user tying to access www.example.com will automatically get forwarded to example.cloud1.com or if that one is offline to example.cloud2.com. Googling around it seems that SRV records are not supported by any major browser, is that true?
second A-record: Add an additional A-record as alternatives. Drawbacks: a) at my hosting provider I do not see any possibility to add a second A-record but just one... is that normal? b)if one server of two servers are down I am not sure if the user gets automatically re-directed to the other one or 50% of all users get a 404 or some other error

Any clues for a best-practice would be appreciated

Cheers, Sebastian

Solution

The availability of the instance i.e. SLA when specified by the Cloud Provider means the "Instance's Health is server running in the context of Hypervisor or Fabric Controller". With that said, you need to take an effort and ensure the instance is not failing because of your app / OS / or pretty much anything running inside the instance. There are few things which devops tend to miss and that kind of hit back hard like for instance - forgetting to configure the OS Updates and Patches.

The fundamental axiom with the availability is the redundancy. More redundant your application / infrastructure is more availabile is your app.

I recommend your to look into the Azure Traffic Manager and then re-work on your architecture. You need not worry about the SRV record or A-Record. Just a CNAME for the traffic manager would do the trick.

The idea of traffic manager is simple, you can tell the traffic manager to stand after the domain name ( domain name resolution of the app ) then the traffic manager decides where to send the request on considerations of factors like Round-Robin, Disaster Management etc.

With the combination of the Traffic Manager and multi-region infrastructure setup; you will march towards the high availability goal.

Links

Azure Traffic Manager Overview

Cloud Power: How to scale Azure Websites globally with Traffic Manager