Collapse AllExpand All

HA Planning

High availability is the result of the combination of functionality provided by Eucalyptus and the environmental and operational support to maintain the systems proper operation. Eucalyptus provides functionality aimed at enabling highly available deployments:
  1. Detection of hardware and network faults which impact system availability: Availability of the system is determined by its ability to properly service a user request at a given time. The system is available when there is at least a set of functioning services to perform the operations which result from a user request (i.e., system is distributed and operations require orchestration involving some, possibly all, services in the system).
  2. Deployment of redundant services to accommodate host failure: A failure is the observed consequence of an underlying fault which compromises the systems function in some way (possibly compromising availability).
  3. Automated recovery from individual component failure: Eucalyptus can take advantage of redundant host and network resources to accommodate singular failures while preserving the system's overall availability. As a result, the deployment of the system plays a large role in the level of availability that can be achieved.
To deliver services with high availability, Eucalyptus depends upon redundant hardware and network.

Considerations

A highly available deployment is able to mitigate the impact on system availability of faults from the following sources:
  • Machines hosting Eucalyptus services: Hardware faults on machines hosting Eucalyptus services can result in component services being unavailable for use by the system or users. The state of the hosting machine is monitored by the system and determines whether it can contribute to work done. In support of high availability, you can configure redundant component services. With redundant component services, Eucalyptus can isolate and mask the a component's failure.
  • Inter-component networks: Faults in the networks that connect the system's components to each other can prevent access to cloud resources and restrict the system's ability to process user requests. First, internal resources may become unavailable. For example, a single network outage could impact access to attached volumes or prevent access to running instances. Second, the coordination of services needed to process user requests may be impeded even if the service state is otherwise healthy.
  • User-facing network connections: User-facing network faults can prevent access to an otherwise properly functioning system. The ability of a user to access the system is difficult to determine from the perspective of the system - can't look through the users eyes. Allowing for multiple inbound paths (for example, multiple disjoint routes) decreases the possibility of an availability-impacting outage occurring w/in the scope of the environment within which Eucalyptus is deployed. (See also: registering arbitrators)

Recommendations

To ensure availability in the face of any single failure, we recommend the following deployment strategy:
  • Host/Service Redundancy: Each component which is registered should have a complementary service registered on a redundant host. For example, the cloud and walrus services should be installed and registered on two hosts. Additionally, for example, each partition should have two cluster controllers and storage controllers (and VMware Brokers, if VMware is being used) configured. Each such complementary pair of services can suffer a single outage before system availability is compromised.
  • Inter-component Network Redundancy: Each host of a component service should have redundant and disjoint network connections to other internal component services and supporting systems (for example, SANs, vSphere). The recommended approach is to have two ethernet devices (each connected to a disjoint layer-2 network) on each host and bonding the devices. Such a configuration is also suggested on node controllers. Then, the outage of a either layer-2 network or ethernet device on a host does not impact service availability or access to cloud resources.
  • User-facing Network Redundancy: The wide area (where users are) network connection should be redundant and disjoint. Each such path should have an independent arbitrator host whose liveness (as determined by ICMP echo) is used to approximate the users' ability to access the system. Redundant network connections from the local area network to the wide area network and user reachability approximation (arbitrator)
  • System Reachability Approximation: The wide area (where users are) network connection(s) path should have an independent host (arbitrator) whose liveness (as determined by ICMP echo) can serve as a reasonable approximation of users' ability to access the system. Ideally, the host “closest” to the user, but still within the domain of the deployment environment should be used (for example, the border gateway of the hosting AS network). With such an arbitrator host in the network path between the user and the system, a failure by the user to reach an otherwise working service and allow the system to enable the complementary service (which should have a separate network route) restoring user access.

SAN and Multipathing

Multipathing is a way to make the data path from the NC or SC to your SAN device highly available. Mulipathing does this by giving the host two network paths that both lead to the same data volume. This allows the host to switch from one network path to the other, in the event that one path becomes unavailable. Essentially, multipathing decreases the likelihood that a volume will become unreachable from a host (NC). For information about configuring your SAN for multipathing, see Configure the Storage Controller.