After reading about the AWS outage, it occurred to me that while much attention has been given to the effects, little discussion of the fundamental cloud principles at work has been offered. It seems to me that there are two important factors to consider when analyzing an event like the AWS outage:
- the tension between elasticity and fault isolation
- the rarity of rare events
I'll try to muse (as briefly as I can) on both topics.
The Elasticity Versus Fault Isolation Tradeoff
In some sense, a cloud must allow the users to have it both ways. On the one hand, users expect elasticity: the fluid and seamless acquisition of resources from anywhere in the cloud without explicit regard for location or provenance. On the other, the cloud is expected to fence off faults and failures so that they do not propagate globally across the cloud when they occur.
From an implementation perspective, designing a system that is globally seamless with respect to legitimate user-generated events but compartmentalized and isolating with respect to faulty events is tricky. Anyone who has had to try and police a large installation for "spam bots" will understand immediately that the differences in system response when servicing healthy and unhealthy workloads can be subtle. I don't know what the root cause or causes of the AWS outage were, but I predict that once a post-mortem and diagnosis are complete, the design principles that will be most carefully scrutinized are those that deal with this tension between elasticity and fault isolation.
The Rarity of Rare Events
One doesn't need an elaborate probabilistic framework to understand rare events in cloud systems. For example, a "one in a million" failure event means "for every batch of one million operations that the cloud performs, there is one expected failure." This simple statement depends, critically, on a property known as "independence" and it is certainly a matter of interesting architectural debate as to whether it holds for any system. However, at a high level, it is probably useful to think about cloud user requests as being susceptible to this kind of analysis.
The other day I had the opportunity to see Jeff Barr, noted AWS evangelist, speak about Amazon's cloud. He said that parts of the cloud are built with "eleven nines of reliability." If one considers how many operations AWS must perform internally every second, the Law of Large Numbers starts to have an effect. Put another way, even if the chances of failure are one in one hundred billion, every hundred billion operations or so there is going to be a failure. Considering all of the myriad of components that almost assuredly make up AWS, the operation count per second (or minute or day or year) is staggering. The outage, then, as a rare event probably says as much about the popularity of AWS (measured by load) as it does about the scope of the effects.
Large Numbers, End-to-End
In the early 1980's Saltzer, Reed, and Clark published a position paper discussing their views on robust system design. Known as the End-to-End Argument it says, essentially, that the way to build the most robust systems is to build fault tolerance into the applications themselves. Most application and system builders today view The End-to-End argument as being a bit overly reductionist, pertaining most directly to extreme circumstances or needs — an important observation but not necessarily an engineering principle.
The recent AWS outage, however, points out an interesting consequence of cloud computing in which it becomes a victim of its own success. By supporting large user scale, the system also aggregates the risk of a rare failure event. That is, for a cloud no matter how unlikely a failure is, there is a user scale at which the failure, in some sense, must occur. The user load may never reach that scale but it exists, in theory, for all engineering design points.
Thus, in the extreme, the best solution to the problem of tolerating failure reduces to the End-to-End argument: the application itself must include logic for managing failures, no matter how well engineered the system is that they are using, when continuous operation is a requirement. Admittedly, this statement is extreme but clouds like AWS are also extreme in the scale they can support making such reductionist logic potentially useful. AWS is an extraordinarily well-designed and engineered system, as its availability characteristics indicate. There is just no getting around the Law of Large Numbers and for AWS, one has to believe the numbers are large.
Steven Nelson-Smith posted a nice blog entry on engineering insights and practices that pertain to the AWS outage itself. This analysis demonstrates some of the reasoning that is necessary to build robustness into an application deployment in order to achieve survivability. Abstracting and then paraphrasing his excellent advice:
If you are using it, assume it will fail, and have a redundant alternative.
AWS is powerful because it provides the tools to give the application many redundant alternatives. The use of these resources, however, must be designed into all applications that cannot tolerate down time.
Finally, I find it interesting that a new and exciting technology such as cloud computing has made rather well-worn principles such as the End-to-End argument and even the Law of Large Numbers useful tools for understanding the power and the consequences of this technology's development.