java saas high-availability cloudbees paas

CloudBees Service Level Agreement(s) and Capabilities Service

I have been comparing Java PaaSes carefully and am really starting to like CloudBees. I only have one big concern with them, and that is their SLA/uptime.

After scouring through all of their documentation, I can only find one paper they offer on SLAs which states:

If you are using the CloudBees PaaS without taking advantage of high availability options, then CloudBees can only offer uptime that approaches the base uptime SLA of the infrastructure cloud provider.

As the same paper also mentions, Amazon seems to offer a 99.95% uptime, and I know that CloudBees runs - largely - on AWS/EC2 instances itself.

So this spawns a number of closely-related SLA questions:

If I don't take advantage of "high availability" options, then can I assume that CloudBees doesn't even guarantee 99.95%? Or is there documentation elsewhere that does state what their uptime is, and remedies for failing to meet that uptime?
What High Availability options are they talking about here? I just read their entire developer docs and never saw anything about HA.
What are my remedies if a partner service (like SendGrid for mail, or MemCachier for caching) goes down? One thing I do like about GAE is its CapabilitiesService where, before you go to use their Email API, or Caching API, you first check with the master CapabilitiesService to make sure those services are operating. I'd like to do the same with CloudBees, but seems like I'd need to build it myself. That's fine, but not sure if CloudBees even offers a mechanism (API call, etc.) to determine if a particular service partner is on or offline.

Thanks in advance!

Solution

CloudBees does not offer an SLA on availability nor remedies in the form of credits if a particular level of uptime is not met in a month. This is AFAIK common for other offerings on AWS (e.g., Heroku). CloudBees does offer standard response-time based SLAs via a support agreement. As discussed in the white paper you reference, we also employ practices for our own usage of AWS and external providers that has helped to isolate our users from some specific Amazon issues.
The availability features you can make use of include:
- Using multiple instances (and potentially auto-scale). App instances are spread by CloudBees across different EC2 instances, so you can avoid downtime in the event of an EC2 instance failure.
- Using the session store. You can share session state in a separate tier from your app instance using our offering or a partner offering like Memcachier.
- Using dedicated servers that CloudBees sets up in multiple AWS availability zones.
- Ensuring the database used with your app is set up in a highly available configuration. For example, RDS is simple to use with CloudBees and supports standbys and read replicas in multiple AZs.
- Using app monitoring solutions from partners like New Relic and AppDynamics to alert you of any issues.
The main point of the comment about using "high availability options" was to warn people that simply deploying an app on CloudBees does not make it highly available. If an EC2 instance fails underneath your single-instance deployment, your users will experience downtime while our internal machinery redeploys to a working instance, whereas a multi-instance deployment will likely only experience slower responses until a new instance is deployed. Similarly with single-instance databases without standbys or replicas across AZs. While this is just stating the blindingly obvious for a lot of people, you might be surprised how many people just assume some magic is happening.
Good point on the CapabilitiesService! We have some ideas kicking around in this area, but you would have to do something like this on your own for now.