Search code examples
azurefailoveracsws-trust

How to failover Azure ACS if a data center goes down


We are looking for a way to provide failover for ACS instances, so if one data-center goes offline, authentication via ACS automatically fails over into another data center.

Background:

We use ACS to transform SAML tokens that are provided by a custom-developed STS via the WS-Trust protocol. ACS is used to broker trust between our STS and a number of relying parties that are developed by 3rd parties. The relying parties are currently configured to connect to a specific ACS instance using its DNS URL.

We have looked into the following:

  1. Using a DNS CName entry to mask the ACS url - doesn’t work because the new DNS will not match the SSL cert on the instance, and we can’t control the SSL Cert.
  2. Using a proxy in front of ACS to route the requests to it - doesn’t work because the To address and Realm in the messages doesn’t match the acs namespace.
  3. Traffic Manager doesn’t work because of both 1 and 2, and because it won’t currently let you direct load to an address that doesn’t end in .cloudapp.net.

Solution

  • I don't think there is a realistic and foolproof solution here. As noted, you can create additional namespaces in other datacenters and take backups of your RP configs and transformation rules. To recover, your clients would need to reconfigure their apps to use the new namespace after you restore a backup to the new namespace. This can work in some scenarios (like Google and Yahoo! integration). It can even work (I think) for Active Directory integration. It is very problematic if you don't control the RP however.

    A different, but blocking problem with this approach as well (for us at least) is that it won't work in the case of Windows Live name identifier claims. We get a different one per namespace for our users. So, even if we restored all our settings in another datacenter (and we control the RPs too!), our Windows Live users would be unable to login correctly because their name identifiers would no longer match with the new namespace. Google and Yahoo! would not have this problem as they can use a stable claim (like email).

    Basically, it appears you are mostly at the mercy of the datacenter operations team to failover to the subregion quickly in case of total datacenter loss.