High availability is the result of the combination of functionality
provided by Eucalyptus and the environmental and operational support
to maintain the systems proper operation. Eucalyptus provides
functionality aimed at enabling highly available deployments:
- Detection of hardware and network faults which impact system
availability: Availability of the system is determined
by its ability to properly service a user request at a given
time. The system is available when there is at least a set of
functioning services to perform the operations which result from
a user request (i.e., system is distributed and operations
require orchestration involving some, possibly all, services in
the system).
- Deployment of redundant services to accommodate host
failure: A failure is the observed consequence of an
underlying fault which compromises the systems function in some
way (possibly compromising availability).
- Automated recovery from individual component failure:
Eucalyptus can take advantage of redundant host and network
resources to accommodate singular failures while preserving the
system's overall availability. As a result, the deployment of
the system plays a large role in the level of availability that
can be achieved.
To deliver services with high availability, Eucalyptus depends upon
redundant hardware and network.
Considerations
A highly available deployment is able to mitigate the impact on
system availability of faults from the following sources:
- Machines hosting Eucalyptus services: Hardware faults
on machines hosting Eucalyptus services can result in
component services being unavailable for use by the system
or users. The state of the hosting machine is monitored by
the system and determines whether it can contribute to work
done. In support of high availability, you can configure
redundant component services. With redundant component
services, Eucalyptus can isolate and mask the a component's
failure.
- Inter-component networks: Faults in the networks that
connect the system's components to each other can prevent
access to cloud resources and restrict the system's ability
to process user requests. First, internal resources may
become unavailable. For example, a single network outage
could impact access to attached volumes or prevent access to
running instances. Second, the coordination of services
needed to process user requests may be impeded even if the
service state is otherwise healthy.
- User-facing network connections: User-facing network
faults can prevent access to an otherwise properly
functioning system. The ability of a user to access the
system is difficult to determine from the perspective of the
system - can't look through the users eyes. Allowing for
multiple inbound paths (for example, multiple disjoint
routes) decreases the possibility of an
availability-impacting outage occurring w/in the scope of
the environment within which Eucalyptus is deployed. (See
also: registering arbitrators)
Recommendations
To ensure availability in the face of any single failure, we
recommend the following deployment strategy:
- Host/Service Redundancy: Each component which is
registered should have a complementary service registered on
a redundant host. For example, the cloud and walrus services
should be installed and registered on two hosts.
Additionally, for example, each partition should have two
cluster controllers and storage controllers (and VMware
Brokers, if VMware is being used) configured. Each such
complementary pair of services can suffer a single outage
before system availability is compromised.
- Inter-component Network Redundancy: Each host of a
component service should have redundant and disjoint network
connections to other internal component services and
supporting systems (for example, SANs, vSphere). The
recommended approach is to have two ethernet devices (each
connected to a disjoint layer-2 network) on each host and
bonding the devices. Such a configuration is also suggested
on node controllers. Then, the outage of a either layer-2
network or ethernet device on a host does not impact service
availability or access to cloud resources.
- User-facing Network Redundancy: The wide area (where
users are) network connection should be redundant and
disjoint. Each such path should have an independent
arbitrator host whose liveness (as determined by ICMP echo)
is used to approximate the users' ability to access the
system. Redundant network connections from the local area
network to the wide area network and user reachability
approximation (arbitrator)
- System Reachability Approximation: The wide area
(where users are) network connection(s) path should have an
independent host (arbitrator) whose liveness (as determined
by ICMP echo) can serve as a reasonable approximation of
users' ability to access the system. Ideally, the host
“closest” to the user, but still within the domain of the
deployment environment should be used (for example, the
border gateway of the hosting AS network). With such an
arbitrator host in the network path between the user and the
system, a failure by the user to reach an otherwise working
service and allow the system to enable the complementary
service (which should have a separate network route)
restoring user access.
SAN and Multipathing
Multipathing is a way to make the data path from the NC or SC to
your SAN device highly available. Mulipathing does this by
giving the host two network paths that both lead to the same
data volume. This allows the host to switch from one network
path to the other, in the event that one path becomes
unavailable. Essentially, multipathing decreases the likelihood
that a volume will become unreachable from a host (NC). For
information about configuring your SAN for multipathing, see
Configure the Storage Controller.