Eucalyptus and Nagios

Last Updated: April 15, 2013

Have you deployed your own Eucalyptus cloud and want to share what you've learned? Contribute or participate in building the next version of our Reference Architectures.


Production deployments of Eucalyptus, like production deployments of any infrastructure software running in a data center, require that a health and status monitoring system be installed and operational in order to both allow the Eucalyptus/data-center administrator the ability to stay on top of evolving resource situations and to provide invaluable diagnostic information when error conditions and other faults occur within the resource pool (servers, networks, storage, etc.). One such system that we have used in production, and recommend to users who do not already have a monitoring system in place is Nagios.

Nagios is a freely available, open-source IT resource monitoring system. From the Nagios website:

"Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes."

Installing Nagios

In this document, we step through the process of installing a basic Nagios monitoring deployment on a set of Eucalyptus systems running CentOS or RHEL 6.

Step 1: Install Nagios

Having installed Eucalyptus from packages, we already have added the package repositories that contain the Nagios packages from EPEL. For all servers running a Eucalyptus component, run the following to install the Nagios remote test agent (NRPE) and the service check plugins:

# yum install nrpe nagios-plugins-all nagios-plugins-nrpe

Then, on a server that we refer to as the 'Nagios Server', install the Nagios package:

# yum install nagios

Nagios is now installed.

Step 2: Configure Nagios for Basic System Monitoring

There are a few steps required to get basic system monitoring going with Nagios. Note that all 'unique' settings for a distributed monitoring installation are constrained to the single Nagios server (i.e. the remote host configuration is identical, which makes it easy to get it right and push out without having to maintain a unique config for each host).

On all hosts, push out the configuration file that will allow the Nagios server to interact with NRPE daemon, which is done by making the following setting

  • edit /etc/nagios/nrpe.cfg
  • change 'allowed_hosts=127.0.0.1' to 'allowed_hosts='
  • change the checks at the end to be a little more in line with the built in local check definitions
command[check_users]=/usr/lib64/nagios/plugins/check_users -w 5 -c 10 
  
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
  
command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /
  
command[check_procs]=/usr/lib64/nagios/plugins/check_procs -w 250 -c 400 -s RSZDT
  
command[check_swap]=/usr/lib64/nagios/plugins/check_swap -w 20% -c 10%
  • push the file out to all hosts to /etc/nagios/nrpe.cfg
  • enable the NRPE service to run on system boot (one time operation) and start the NRPE daemon by running the following on all hosts
chkconfig --level 2345 nrpe on
  
service nrpe start

Next, on the Nagios server, we modify the configuration to allow the use of NRPE, and to read remote host config files from a local directory where we'll store each Eucalyptus host's unique configuration.

  • edit /etc/nagios/nagios.cfg
  • uncomment the line 'cfg_dir=/etc/nagios/servers' and save the file
  • create the /etc/nagios/servers directory
  • edit /etc/nagios/objects/commands.cfg add the following to the end of the file, and save
define command{
  
  command_name check_nrpe 
  
  command_line /usr/lib64/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ 
  
}
  • set the nagios admin password to 'nagios' by running 'htpasswd -bc /etc/nagios/passwd nagiosadmin nagios' (choose some other password that you prefer)

Next, on the Nagios server, we set up one configuration file, per eucalyptus host, that defines both the host end-point itself and also which checks (services) to run on that host. Each server configuration file should be placed in /etc/nagios/servers, and must end with a '.cfg' file prefix. An example of such a server config file follows. The only site-specific modification is to set the server's IP address in the 'host' section.

###############################################################################
###############################################################################
#
# HOST DEFINITION
#
###############################################################################
###############################################################################

# Define a host for the local machine

define host{
 use linux-server
 host_name my-cloud-controller
 alias Cloud Controller
 address 10.102.1.24
 check_interval 1
 }
###############################################################################
###############################################################################
#
# SERVICE DEFINITIONS
#
###############################################################################
###############################################################################

# Define a service to "ping" the local machine

define service{
 use local-service ; Name of service template to use
 host_name my-cloud-controller
 service_description PING
 check_command check_ping!100.0,20%!500.0,60%
 }

# Define a service to check the disk space of the root partition

# on the local machine. Warning if < 20% free, critical if

# < 10% free space on partition.

define service{
 use generic-service ; Name of service template to use
 host_name my-cloud-controller
 service_description Root Partition
 check_command check_nrpe!check_disk
 }

# Define a service to check the number of currently logged in

# users on the local machine. Warning if > 20 users, critical

# if > 50 users.

define service{
 use generic-service ; Name of service template to use
 host_name my-cloud-controller
 service_description Current Users
 check_command check_nrpe!check_users
 }

# Define a service to check the number of currently running procs

# on the local machine. Warning if > 250 processes, critical if

# > 400 users.

define service{
 use generic-service ; Name of service template to use
 host_name my-cloud-controller
 service_description Total Processes
 check_command check_nrpe!check_procs
 }

# Define a service to check the load on the local machine.

define service{
 use generic-service ; Name of service template to use
 host_name my-cloud-controller
 service_description Current Load
 check_command check_nrpe!check_load
 }

# Define a service to check the swap usage the local machine. 

# Critical if less than 10% of swap is free, warning if less than 20% is free

define service{
 use generic-service ; Name of service template to use
 host_name my-cloud-controller
 service_description Swap Usage
 check_command check_nrpe!check_swap
 }

# Define a service to check SSH on the local machine.

# Disable notifications for this service by default, as not all users may have SSH enabled.

define service{
 use generic-service ; Name of service template to use
 host_name my-cloud-controller
 service_description SSH
 check_command check_ssh
 notifications_enabled 0
 }

Finally, when all of your hosts have such a file in place, enable the services httpd and nagios to run on boot (one time operation), check the validity of your configuration changes and then start up Nagios on the front-end with:

chkconfig --level 2345 httpd on
  
chkconfig --level 2345 nagios on
  
nagios -v /etc/nagios/nagios.cfg
  
service httpd start
  
service nagios start

Nagios should now be up and monitoring your environment with the basic checks that we've enabled for each host. To use the Nagios UI, point a browser at your Nagios server (http://your.nagios.server.ip/nagios), log in with user 'nagiosadmin' and whatever password you set above (in this example, it was 'nagios'). To verify basic functionality, navigate to the 'hosts' and 'services' displays, which show the status of all of the hosts/services that have been defined. It takes a few minutes at first for the polling to get started, but we will see services moving from 'PENDING' to 'OK' (or 'WARNING' or 'CRITICAL') within five minutes, or so.

Step 3: Configure Nagios for Eucalyptus

At this point, we have a simple to set up monitoring tool usable for managing and maintaining a Eucalyptus deployment. Knowing that networks are up/down, disks are free/full, load is low/high is in most cases necessary information to have in hand when approaching any Eucalyptus deployment problem. Next, we add a few Eucalyptus specific checks to the installation, using the built in logfile checker that comes with Nagios as a basic Eucalyptus service health/status monitor.

  • edit /etc/nagios/nrpe.cfg
  • add the following check definitions
# Eucalyptus checks
   command[check_cclog]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/cc.log -O /tmp/nagioscc.log -q "ERROR|FATAL"

   command[check_ccfaults]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/cc-fault.log -O /dev/null -q "ERR-"

   command[check_nclog]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/nc.log -O /tmp/nagiosnc.log -q "ERROR|FATAL"

   command[check_ncfaults]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/nc-fault.log -O /dev/null -q "ERR-"

   command[check_cloudlog]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/cloud-output.log -O /tmp/nagioscloud.log -q "ERROR|FATAL"

   command[check_cloudfaults]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/cloud-fault.log -O /dev/null -q "ERR-"

   command[check_walrusfaults]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/walrus-fault.log -O /dev/null -q "ERR-"

   command[check_scfaults]=/usr/lib64/nagios/plugins/check_log -F /var/log/eucalyptus/sc-fault.log -O /dev/null -q "ERR-"
  • save the file, and push it out to /etc/nagios/nrpe.cfg on all eucalyptus hosts

Next, to each server config in /etc/nagios/servers on the nagios server machine, put in place service definitions to the appropriate configs (for example, add the cloud/walrus checkers to the machine running the Cloud Controller and/or Walrus, add the cluster controller (CC) checkers to the machine running the cluster controller, etc.). Choose the appropriate checkers listed here and add them to the appropriate server configuration files. Make sure that when adding a service to a server configuration, that the 'host_name' field matches the actual defined host_name that is set in that server's 'host' section of the configuration.

## Cloud Controller Checkers

define service{
 use generic-service ; Name of service template to use
 host_name my-cloud-controller
 service_description Cloud Logs
 check_command check_nrpe!check_cloudlog
 }

define service{
 use generic-service ; Name of service template to use
 host_name my-cloud-controller
 service_description Cloud Faults
 check_command check_nrpe!check_cloudfaults
 }

## Walrus Checkers

define service{
 use generic-service ; Name of service template to use
 host_name my-walrus
 service_description Walrus Logs
 check_command check_nrpe!check_cloudlog
 }

define service{
 use generic-service ; Name of service template to use
 host_name my-walrus
 service_description Walrus Faults
 check_command check_nrpe!check_walrusfaults
 }

## Storage Controller Checkers

define service{
 use generic-service ; Name of service template to use
 host_name my-storage-controller
 service_description SC Logs
 check_command check_nrpe!check_cloudlog
 }

define service{
 use generic-service ; Name of service template to use
 host_name my-storage-controller
 service_description SC Faults
 check_command check_nrpe!check_scfaults
 }

## Cluster Controller Checkers

define service{
 use generic-service ; Name of service template to use
 host_name my-cluster-controller
 service_description Cluster Controller Logs
 check_command check_nrpe!check_cclog
 }

define service{
 use generic-service ; Name of service template to use
 host_name my-cluster-controller
 service_description Cluster Controller Faults
 check_command check_nrpe!check_ccfaults
 }

## Node Controller Checkers

define service{
 use generic-service ; Name of service template to use
 host_name my-node-controller
 service_description Node Controller Logs
 check_command check_nrpe!check_nclog
 }

define service{
 use generic-service ; Name of service template to use
 host_name my-node-controller
 service_description Node Controller Faults
 check_command check_nrpe!check_ncfaults
 }

Finally, restart Nagios on the Nagios server, and NRPE daemon on all hosts, and check out the UI.

Step 4: Use Nagios

Here is an example screen shot of the resulting services UI, where we've induced an ERROR condition on the Cloud Controller log file checker by sending the service invalid requests.

Next Steps

In order to receive notifications when checkers fail, Nagios can be configured to send email notifications with a configurable frequency and destination address. In addition, parameters such as the frequency with which services are checked, how many times they should be checked before they are determined as 'down', and many others can also be configured. Please refer to the Nagios documentation for information on how to fine tune a Nagios installation. We also maintain a live Github project that contains additional Nagios checkers and tests for Eucalyptus.

Additional Resources: