By Shiji Sujai, IOD Expert I first came across a full-fledged DR Suite almost 10 years ago when the organization I was working for decided to implement a real-time DR…
Outages are inevitable. As we’ve seen over the past few years, every major cloud vendor’s experienced at least one, and we can expect that they will again at some point in the future. As cloud consumers, we need to be able to use the cloud’s building blocks and unlimited resources (at least, in theory), and create service robustness and high availability. Yet, important issues, like SLAs, remain unclear when it comes to consuming resources and services from IaaS vendors.Today, more than ever, online software service vendors, have a lot to lose when their services suffer from performance degradation. They could lose significant amounts of revenue as a result of actual outages as well as diminished user loyalty. In this article, I will share baseline perceptions and methods of cloud-based DR.
In April 2011, when Amazon’s cloud s east region failed. I posted the first chapter of theAmazon Cloud Outage Conspiracy – it was already very clear that the cloud will fail again and here it is… Chapter 2
Let’s first try to understand Amazon’s explanation for this outage.
“At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power.”
A lot has already been said about the false cloud use where the IaaS platform utilized as an hosting extension of the IT organization’s data center and not taking advantage of the elasticity benefits to generate a cost effective and scalable IT operation. Using the public IaaS whether it is Amazon, Rackspace or any other vendor means using a highly dynamic environment which presents an increasing complexity hence loss of control. Checking the list below I can say that cloud (including all its layers IaaS, PaaS and SaaS) control basically contains the same aspects as the good old system management.
What is “System management” ?
Traditionally delivering high availability often meant replicating everything. However, today with the option of going to the cloud we can say that providing two of everything is costly. High availability should be planned and achieved at several different levels: including the software, the data center and the geographic redundancy. According to a recent study the cost of a data center outage ranges from a minimum cost of $38,969 to a maximum of $1,017,746 per organization, with an overall average cost of $505,502 per incident.
1 – Total cost of partial and complete outages can be a significant expense for organizations.
2 – Total cost of outages is systematically related to the duration of the outage.
3 – Total cost of outages is systematically related to the size of the data center
4 – Certain causes of the outage are more expensive than others. Specifically, IT equipment failure is the most expensive root cause. Accidental/human error is least expensive.
From an attacker’s perspective, cloud providers aggregate access to many victims’ data into a single point of entry. As the cloud environments become more and more popular, they will increasingly become the focus of attacks. Some organizations think that liability can be outsourced, but no, and I hope that we all understand it cannot. The contract with your cloud vendors basically means nothing, the ISVs or should I say the `SaaS providers` still holds the responsibility, so rather than focusing on contracts and limiting liability in cloud services deals, you should focus on controls and auditability.
Last week my Twitter blinked massively by news magazines and cloud blogers that reported on the extraordinary news: “The cloud computing crashed”. Amazon AWS had suffered a major outage in its US East facility. This was the worst in cloud computing’s and Amazon’s history. This failure affected major sites such as Heroku, Reddit, Foursquare, Quora and many more well-known internet services hosted on EC2. From what I read, it seems that automated processes began replicating a large number of EBS volumes, which harmed EBS performance and availability across multiple availability zones in north Virginia region.
“..However badly they’ve been affected, providers have sung Amazon’s praises in recognition of how much it’s helped them run a powerful infrastructure at lower cost and effort.” Seven lessons to learn from Amazon’s outage (ZDNet SaaS Blog)