By Rich Wolski | June 04, 2012
In the last post on HA I tried to provide some of my perspective on the "what" and "why" of the High Availability (HA) feature in Eucalyptus. This follow-up post is intended to focus on the "how," and, as a result, delves into some of the implementation details, albeit at the architectural level.
There are two important architectural features of Eucalyptus that influence its implementation of HA. Eucalyptus consists of individual service components each of which "plugs" into a system wide communication and synchronization substrate. The features are
- Service Bus Architecture — Eucalyptus implements an Enterprise Service Bus to integrate components, and
- Service Location Independence — The interconnection topology interlinking service components is not visible to the components themselves.
The result of these two characteristics is that HA must be implemented at the component level in a way that does not depend on how the components are deployed across a specific infrastructure topology. Eucalyptus extends the service bus transparently across process and machine boundaries so that services components can be assigned to machines, more or less, independently.
The Eucalyptus Service Bus (ESB) is implemented in Java using Mule (which is an absolutely amazing piece of software — highly recommended) as its core technology. All service components except the Cluster Controller (CC) and Node Controller (NC) are written in Java and execute in the same JVM when co-located on a single machine. The CC and NC are web services written in C. There is an adapter that proxies the CC onto the ESB for the purposes of HA. The NC is not proxied onto the ESB, however, since it is logically an agent for the VM and is not implemented for HA (Eucalyptus 3 does not implement HA for VMs). Thus, the HA implementation must also be able to tolerate three separate component failure events:
- individual component failure: a single software component has failed or become unresponsive
- Java Virtual Machine failure: the JVM containing co-located components fails causing all of them to fail simultaneously, and
- machine failure: the machine hosting Java components and/or a CC fails causing a correlated JVM and/or CC failure.
Because the combinatorial space of possible co-locations of service components is large, Eucalyptus must implement HA at the component level and not at the Linux cluster or JVM level. One question that is often asked is "Why doesn't Eucalyptus use some form of HA Linux?" Part of the answer is that HA Linux handles machine failure (the third case) but not the other two. It could be used as part of the solution, but another mechanism in addition would be needed for the other two finer-grained failure cases.
Network Failure can Produce a Splitting Headache
Network failures are another source of "correlated" failure that Eucalyptus recognizes and manages. The Eucalyptus control plane consists of two separate logical networks: the service bus interconnecting all front-end control components (CLCs, Walruses, CCs, and SCs), and the actualization network consisting of point-to-point connectivity between each NC and the CC that manages the cluster within which it is located. For descriptions of these service components please see the previous blog post. Summarizing here, though, HA implements "hot failover" on the service bus so that the failure or inoperability of any service component will not cause the cloud to halt.
However, sometimes the problem is not that a component has failed, but rather that it has failed to fail when the network to which it is attached has become faulty. For example, in Eucalyptus, the Cloud Controller (CLC) manages all requests pertaining to user credentials. Consider the failure condition in which the active CLC suddenly becomes disconnected from the other Eucalyptus components, but not from the users. The back-up CLC will become active and "take over" but the original CLC (now disconnected) is still "talking" to users. Thus a request might come in to one CLC delete a user and another to create a user with the same name. What happens if the create occurs before the delete? Because both are active and still functioning, they can take contradictory actions that are uncoordinated.
This condition is typically called a "split brain" condition and it occurs when services capable of changing the state of the system are both functional but the communication between them is not (e.g. the network has experienced a "network partition"). Avoiding a split brain condition is tricky. Distributed consensus protocols such as two-phase commit and three-phase commit are examples of transaction protocols designed to allow collections of collaborating processes to avoid split brain confusion. They can have unpleasant performance properties, however, if not implemented and then used with a great deal of care.
Eucalyptus HA Architecture
The Eucalyptus back-end components (the CC and NC) are stateless. They refresh the information they require to function from the systems they are running on and from the other components periodically. Thus when one of these components activates, it begins to function without the need to synchronize with the rest of the system.
The other components (CLC, Walrus, SC and VMWare Broker) maintain object state using Hibernate to implement persistence in a backing SQL database. Using Hibernate alone could result in a stateless implementation, but each object access (either read or write) would require a multi-row database transaction. To avoid the potential performance penalty associated with accessing the database every time the Java objects are accessed, the Eucalyptus component stack implements a couple of layers of "write-through" object caching in memory. The caches improve both request throughput and response time dramatically, but they add an additional layer of complexity to the implementation of HA.
Specifically, consider what happens when a Java component fails, losing its local object cache. The persistent data necessary to instantiate a replacement component is in the database. If the system is busy during the failure, though, it could be many minutes before the back up component has reconstituted its objects. If the back up runs directly from the database, then the system throughput and response time could be seriously degraded immediately following a fail over since the in-memory caches are not populated.
To solve this problem, the Eucalyptus component stack using a distributed L2 cache that keeps the object cache on the primary and back-up components synchronized. The cache (currently implemented with Tree Cache) uses JGroups to implement distributed consistency in the presence of failures.
The distributed cache consistency protocol allows Eucalyptus to use separate databases rather than a single, replicated database in an HA configuration. The following figure shows the HA component stack organization.
There are two benefits to this implementation approach: HA at the service level and scalable bootstrapping.
HA Implemented at the Service Level
The first is that it works at the web service level. The component stack implements all Java components in the system regardless of their deployment location (either machine location or JVM location). Thus an HA deployment and a non-HA deployment differ only in how Eucalyptus is configured and not what software components are configured into it and/or the location of those components are deployed. That is,
in Eucalyptus 3, HA is carried in the DNA of the implementation.
The only difference between an HA deployment and a non-HA deployment is whether "back-up" components have been registered with the cloud.
A second benefit of this approach is that it solves a subtle scale problem. If you have read this far, and you are a Eucalyptus skeptic (as I am from time to time), you are probably saying to yourself "Eucalyptus could use some combination of Linux-HA, database clusters, Pacemaker, Corosync, etc. to implement HA in a componentwise fashion." It is true. If all combinations of component failures can be covered using a GMS (group management service), replicated databases, and heartbeat then it is possible to implement HA.
However a componentwise approach typically relies on a bootstrapping mechanism (a mechanism for starting the cloud in a way that is both scalable and secure) that requires the system to go from "not HA" to "HA" at a specific moment. For example, if multiple separate HA technologies are employed, they each need to start and achieve HA execution before they can be stitched together to form a system. The database must be up and HA, the message layer must be up and HA, web server stacks must be up and HA, etc. and then the system can compose itself from these separate HA components.
At scale, under load, the moments when everything is up and functioning properly so that the system can successfully initialize may be few and far between. That is, in a distributed system, there is likely a scale when a failure is always occurring somewhere making it impossible synchronize separate HA components.
Eucalyptus, because HA is built into the component fabric, bootstraps incrementally regardless of the scale at which it is deployed.
That is, as soon as the object persistence layer has a primary and a back up, the system is booted. As more components are added, it continues to function, growing incrementally even when failures are occurring. Put another way, except for the moment with the primary and back up for the persistence layer boot, Eucalyptus does not need to wait for a "failure free" synchronization moment to stitch itself together. Thus, its HA bootstrap mechanism scales.
Is the User Highly Available?
One interesting wrinkle is that HA Eucalyptus includes a new service that attempts to determine whether the user has failed. More accurately, the group membership and heartbeat protocols work between active Eucalyptus service components to detect partial network failures within the Eucalyptus cloud itself. What happens, though, if the network between the cloud and the user experiences a partial failure? Logically, the user (or the user's "terminal" — iPhone, laptop, deskside, etc.) is a service component that sends and receives messages. The network connecting it to the CLC or Walrus services can also experience a partial failure.
Eucalyptus now includes an arbitrator service that proxies the user onto the ESB. Because the interfaces use either REST or SOAP, it is not possible to "ping" the user to determine if the network route between the user and the cloud is functional. The Arbitrator can be configure to ping a set of "distant" IP addresses, however, that act as a proxy for the user in terms of connectivity.
For example, if a Eucalyptus installation can ping "Google" then its administrators might decide that the externally facing network is "up."
This feature is useful in enterprise settings where there are redundant network routes available to the Eucalyptus cloud. If one of the routes fails, the Arbitrator will cause the primary CLC and/or Walrus service component using that route to fail over to its respective back up. The deployment wishing to use this feature must take care to put primary and back up service components in network locations that can be serviced by multiple routes, and to configure the Arbitrator to ping the right set of IP addresses. Notice, however, that treating the user as a service component "falls out" of the general HA architecture — HA now truly is in the DNA of Eucalyptus.
Beyond Eucalyptus 3 HA
In closing, I want to emphasize that like all things Eucalyptus, the HA implementation is the first in what will surely be an evolution of implementations. We built HA into Eucalyptus 3 to provide a "first look" at what portable, scalable HA might "do" to a private cloud on behalf the users and customers who really need it to run Eucalyptus as critical infrastructure. There are a huge (literally combinatorially huge) number of ways this first implementation can be extended. In particular * The C web service stack could be re-written and made HA * The current implementation is "primary-back up" and does no load balancing. A multi-master approach with load balancing is possible. * Some of the storage systems (Walrus and the SC) rely on HA functions in their backing store (e.g. DRBD for Walrus). * The GMS implementation works well in a data center. It is not clear if the architecture and/or implementation will extend to the wide area where networks are slower and lossier.
These challenges make working on Eucalyptus exciting (at least, to us) — portable, open-source distributed systems that are production-ready are new and a recent development. If you find this work interesting too then please help us make the next version even better. We are working on developer documentation (and frankly, documentation in general) and we will be working to make the current code base more readable. In the mean time, the Eucalyptus 3.1 code base is now on GitHub. Please jump in — the water (which is not highly available) is a little choppy but otherwise just fine.