High Availability of Your Cloud Expectations

by Ofir Nachmani
March 5, 2012

The Cloud Service Level Agreement (SLA) discussion puts penalties and compensations on the table. Can we say that the compensation method the customer expects is the same as the Software as a Service (SaaS) vendor’s SLA provides?

A while ago, I experienced issues while starting up a specific instance on Amazon AWS cloud. I’m still not sure why, but the instance entered an endless restart loop. All the application deployment work (installation and configuration of a service) we did on it for about two weeks just went down the drain. Discussion with the Amazon AWS support team ended with an escalation of the support request to their head of support.

Take a look at the following paragraphs copied from the Amazon AWS EC2 SLA –

“Service CommitmentAWS will use commercially reasonable efforts to make Amazon EC2 available with an Annual Uptime Percentage (defined below) of at least 99.95% during the Service Year. In the event Amazon EC2 does not meet the Annual Uptime Percentage commitment, you will be eligible to receive a Service Credit as described below.”

“If the Annual Uptime Percentage for a customer drops below 99.95% for the Service Year, that customer is eligible to receive a Service Credit equal to 10% of their bill (excluding one-time payments made for Reserved Instances) for the Eligible Credit Period. To file a claim, a customer does not have to have wait 365 days from the day they started using the service or 365 days from their last successful claim. A customer can file a claim any time their Annual Uptime Percentage over the trailing 365 days drops below 99.95%. “

By these terms, and the fact that the instance never came back up, this was considered an exception of their commitment. The compensation we got from Amazon AWS for the loss was about $100 worth of service credit. Funny, isn’t it? My expectations were that my service provider would take in mind the actual loss, including the application installment and the time invested from our side to try and save the day. The bottom line, however, was that Amazon expected me to take whatever actions were needed and to make sure that a backup was in place.

After the famous Amazon cloud outage in April this year, Amazon added to its basic mantra the phrase, “shared responsibility” in the matter of HA (high availability) and Security. The phrase refers to the responsibility the cloud vendor, as well as the cloud consumer, must take to make sure all is on.

The april outage didn’t stop us (nor Quora, Netflix and others) from continuing the use of AWS services, however. We found that a move to on-premise, as well as to another cloud vendor, would result in high costs and would be counter to the overall strategy of the IT (utilizing public cloud resources as part of the overall environment). Considering the perceived value of the cloud service and the shifting costs, most of the cloud consumers will need to compromise and continue using their service for the immediate term for sure as well as for the long term. In the long run, the customer can re-evaluate the architecture and the cloud vendor performance in regards to it HA (High Availability). In today’s marketplace, any respectable cloud vendor or service vendor must have a customer retention mechanism in place, and must be able to quickly fix and prove a better service, especially in regards to availability issues.

While checking several other cloud providers’ SLAs — such as Rackspace’s SLA — I found that some of them actually guarantee you 100% availability, including service credit compensation on an outage. I find this suspicious as it is clear that there is no reall balance between the SLA commitment and the penalties that were defined by these vendors themselves. I believe that in these cases, the uptime guarantee, as well as the SLA itself, becomes a marketing tool – as a way of confirming the vendors’ commitment to their customers. The cloud vendor must have a vast experience in utilizing cloud management and must have monitoring tools in place (Such as New Relic). That is trivial but trust me, cloud vendors still struggle with that (click here – – https://status.rackspace.com/ and you will see what I mean). The environment’s performance must be transparent to the customers. It should be tracked and measured so past performance can be presented to the customer at all times in matters such as HA. I believe that the vendor’s commitment should be driven by past metrics, and it must be based on reality, accurate and not marketing driven.

Check this great article brought you by the NYTimes Magazine discussing the Gmail SLA guarantee:

“We don’t believe Five 9s is attainable in a commercial service, if measured correctly,” says Urs Hölzle, senior vice president for operations at Google. The company’s goal for its major services is Four 9s. … Gmail has backup copies offline, but it normally uses two perfectly mirrored live copies — and that introduces the potential for trouble. Last year, Gmail’s availability was 99.984 percent. (This is the percentage of requested actions, such as sending off a message, that were successful.) “

After the Amazon April outage, I wrote the post “Amazon Outage: Is it a story of a conspiracy?”, claiming that this major outage was a manipulation made by Amazon AWS in order to clear expectations from its customers. It was my amusing thought due to an enormous amount of articles that discussed the lack of transparency and customer uncertainty on the environment state. In understanding the “service approach” — from a cloud customer’s perspective — my real consideration for choosing a future cloud vendor will be whether or not the service I’m getting is being monitored, measured, and improved on an ongoing basis. A trustful cloud vendor should present public dashboards (such as the AWS service health dashboard) and a robust notification system, enabling full transparency of the state of its cloud environment. The customer must know vendor expectations when the engagement happens and must be fully aware of the cloud environment’s state on the go. Only when those standards are met should the vendor have the right to ask for this “shared responsibility.” The customer, as well as the vendor, must take responsibility and communicate in case of an outage. A happy cloud service provider is one that feels it’s not being judged, and one that has the customer’s trust. Happy cloud customers – are those that feel in control and safe with their service provider.

Keywords: amazon AWS cloud, amazon AWS support, amazon AWS EC2 SLA, aws service, rackspace sla, AWS service health dashboard