Eucalyptus 3 includes, as one of its new features, the option of enabling "High Availability" or "HA" (pronounced as the two letters and not as the first part of a chuckle). This functionality was overwhelmingly the most frequently requested improvement over Eucalyptus 2 by the community and, at the same time, the one over which the most confusion seems to persist. I'll try and shed some light on the "what," "why," and "how" of HA in Eucalyptus.
What: Automatic Failover and Recovery of Eucalyptus Services
Eucalyptus consists of a number of "always-on" interacting web services. In an HA configuration, each service can be configured with a running-but-inactive spare. If and when an active component fails or becomes disconnected from the other Eucalyptus services, the system will enable the spare on the fly and keep running. In addition, when the failed or disconnected component is restored to full functionality, Eucalyptus will automatically re-incorporate it into the running system and re-establish the automatic failover capability for that component.
There are six Eucalyptus-internal service components
- The Cloud Controller (CLC)
- The CLC handles credentials, all request management except for requests to Walrus, and VM state management.
- Walrus implements Put/Get object storage with append-only semantics and eventual consistency (similar to AWS S3).
- The Cluster Controller (CC)
- The CC manages IP addresses, network provisioning (Layer 2 and Layer 3) and VM inventory and scheduling for a single availability zone.
- The Storage Controller (SC)
- The SC implements Eucalyptus Block Storage (EBS) — dynamic storage volume management — for a single availability zone.
- The VMWare Broker (VMWb)
- The VMWb acts as a proxy for a VMWare installation (ESX-based or vSphere) so that Eucalyptus commands can be actuated in a VMWare virtualization environment.
- The Node Controller (NC)
- The NC actualizes VM and volume actions on a specific hypervisor (other than the VMWare hypervisors).
HA Eucalyptus implements active-spare pairings for all components except NCs. As a result, the platform itself is resilient to the unavailability or failure of the internal services that implement it.
The Point of Failure
Jared Wray, CTO of Tier 3, recently described rather succinctly and cogently Four Golden Rules for High Availability. Rule #1 is "Thou shalt have no single point of failure." It sounds obvious, but for a cloud platform realizing this simple maxim can be challenging to understand and implement.
For example, the "point" in "single point of failure" typically refers to a "machine" or "server." Thus, eliminating a single point of failure in HA Eucalyptus often refers to ensuring that a single machine failure will not cause the cloud to "fail stop" — suspend its ability to service new requests. This form of HA is relatively straight forward to implement if each Eucalyptus component is assigned to a separate machine and the network is not considered a resource that can fail.
However, Eucalyptus service components are designed so that they can be deployed in a wide variety of tenancy configurations. Many small to medium sized deployments, for example, put the CLC, Walrus, the CC, and the SC on a single "head" node that controls NCs that actuate a collection of worker nodes. This configuration is advantageous in that it minimizes the hardware footprint (all control components share a single server). The analogous HA configuration puts the hot spares for the CLC, Walrus, the CC and the SC on a second, backup head node so that if the primary head node fails, the secondary can activate in its place.
Only making that configuration work in HA requires that Eucalyptus manage four simultaneous failures — not just one. From the perspective of the system's internals, each component is logically distinct. Each one is implemented so that it communicates with the others using WSDL-described web service interfaces. Regardless of whether they are co-located on a single server, they appear to each other as separate autonomous services. Thus, to support HA in this minimum hardware configuration, the system must be able to manage multiple simultaneous failures internally.
Put in slightly more formal terms, Eucalyptus (and any distributed system that allows flexible deployment) must automatically manage "correlated" failure conditions that are induced by a specific deployment. The author of the code does not know in advance how the components will be deployed and where failures will be correlated as a result of the deployment.
One place where correlated failure is often overlooked is in the network. Many deployment architectures account for machine and storage redundancy but reply on multi-tenancy in the network fabric. Correlated failures caused by network outages are particularly difficult to manage automatically so great care must go into the HA design if the system is to remain available when network components fail.
High Availability versus Fault Tolerance
We often engage in what is admittedly a potentially philosophical debate over the difference between the meaning of the terms High Availability and Fault Tolerance. This debate is certainly blasphemous in some academic circles but for the purposes of understanding HA in Eucalyptus, a little blasphemous philosophy is necessary.
High Availability refers to the notion that Eucalyptus as a collection of services will remain available in the presence of one or more failures. Individual requests for service Eucalyptus may fail, but the system itself does not experience an outage that is long-term and/or requires manual intervention to repair. Note that technically Eucalyptus 2 without the HA features in Eucalyptus 3 conforms to this definition of HA (as do many other cloud platforms) for some of its services. The loss of a CC, for example, does not prevent the system from servicing requests to other availability zones or Walrus. What is key in Eucalyptus 3 is that the system's full set of services can be deployed in a way that ensures this HA property.
Fault Tolerance refers to the notion that the service masks failures entirely when they occur. A fully fault-tolerant version of Eucalyptus would not only remain viable and active in the presence of failure, but all requests underway when a failure occurs would be unaffected by the failure.
Generally speaking fault tolerance is a stronger and more difficult requirement to fulfill than is high availability. Distributed fault-tolerant systems are notoriously difficult to design and implement, particularly at scale and when they are (e.g. in the banking industry) they have been expensive to build. For these reasons, cloud computing, web 2.0 before it, and e-commerce before that have embraced the utility of high availability as sufficient for production deployment.
Under these definitions, then, Eucalyptus attempts both to make its services "highly available" and to ensure that the users experience whatever "fault tolerance" capabilities can be supported by the infrastructure itself. For example, when a machine or network link fails, Eucalyptus in an HA configuration will both fail over its internal components and attempt to reroute network traffic so that the VMs that are running in the cloud do not experience a loss of network connectivity. If Eucalyptus were merely "highly available" with respect to VM connectivity, dropping the TCP connections to active VMs would be permissible as long as they could be re-established if and when they are retried by the application.
We chose this combination of "high-availability" and "fault-tolerance" largely due to the community and customer requirements we have observed in production deployments. While cloud users and applications exercising the cloud APIs are prepared for requests to fail, older legacy application code, often developed for a single machine environment, is often not written to retry in the face of service interruption. Thus, with Eucalyptus 3, users experience HA but VMs experience (to the extent possible) fault-tolerant external services (network connectivity and external storage).
However, Eucalyptus does not yet implement fault-tolerance in the VMs themselves. Many services such as RightScale and enStratus implement a form of HA for VMs with Eucalyptus by simply relaunching VMs that no longer respond to a heartbeat. Thus VM HA is supported both by Eucalyptus 2 and Eucalyptus 3. VM fault-tolerance, however, would require that a new VM "pick up where it left off" at the exact moment of the failure of an existing VM as if no failure had occurred. A full implementation of this type of capability may be possible with hypervisor support, but at present, it is not generally available.
Enough Philosophy — How does it work?
I'll try and outline the technical details associated with the Eucalyptus 3 implementation of HA in my next posting. Even for those with a technically robust constitution, the specifics can cause blurred vision and headaches. For now, the answer to the question "How does it work?" is, at a high level "pretty well." Fail-over times depend on deployment configuration and the performance of the underlying hardware but they are measured in minutes. Repair time (the time to get the system back into an HA configuration after a failure has been cleared) actually can take longer as the system needs to restore its distributed state and doing so is one of the few places where eventual consistency is inadequate. Restore times, though, are still minutes and not hours and VM connectivity is unaffected.
In closing this entry, I'll point out one curious feature of the Eucalyptus HA design and that is that the "primary-backup" or "master-slave" relationship between redundant components is decided by the software and not by the administrator or user. That is, in Eucalyptus 3.0 (the current release with HA), the administrator tells the system where it may place a primary-backup pair, but the system "decides" which is the primary and which is the backup. Furthermore, due to network slowness, Linux scheduling vagrancies, etc. the system may decide, at some point, to change the primary and backup roles for its internal components automatically and without warning.
Eucalyptus 3 does provide a way for the administrator to reconfigure manually the primary and backup mappings through an admin API but because the system attempts to manage failures automatically, there is not, at present, a guarantee that it will respect any particular configuration if it suspects a fail-over is necessary.
A future version of Eucalyptus 3 will include a "preference" option that tells the HA controllers to choose a specific assignment of primary and backup if it can. However we were amused to learn that the HA features in the current Eucalyptus implementation make decisions on their own about how the system should manage itself internally. It isn't exactly SkyNet from "The Terminator" but we are keeping an eye on the deployments we have running in house all the same.
Next time — HA internals. Bye for now.