Collapse AllExpand All

Recovering from a Fail: Walrus

In these examples, we will assume that Walrus WS00 is the primary and WS01 is the secondary Walrus server.

Software Failure Example

In this scenario, WS01 refuses to go to DISABLED state. DRBD complains that it is in split brain mode. drbdadm cstate r0 shows that DRBD is in WFConnection state.
If you are sure that data on WS01 is out of date and can be discarded, execute the following commands to restore HA mode.
  1. Shut down the eucalyptus-cloud process on WS01.
  2. Ensure that the DRBD connection is down by typing "drbdadm disconnect r0" on any of the two Walrus hosts.
  3. On the primary Walrus, WS00, set drbd as the primary by executing "drbdadm primary r0"
  4. On the secondary Walrus, WS01, execute the following command:
    drbdadm -- --discard-my-data connect
    WARNING
    WARNING
    This command will discard all data on WS01 and synchronize data from WS00.
  5. Monitor the state of DRBD by running:
    watch -n 2 cat /proc/drbd
  6. When the data on WS01 is synced, start the eucalyptus-cloud process on WS01.

Hardware Failure Example

In this example, the primary WS00 needs to be taken out of service due to a hardware failure, such as a failed disk.
  1. Shut down the eucalyptus-cloud process on WS00 if it is still running.
  2. Monitor service status by running euca-describe-services on WS01 and ensure that WS01 has taken over as the new primary (state: ENABLED).
  3. Shut down the host running WS00.
  4. If the host running WS00 is to be replaced entirely or the OS reinstalled:
    • On the primary CLC, enter the following to deregister WS00:
      euca_conf --deregister-walrus --component WS00 partition <name of partition>
       					--host <WS00 host>
    • After Linux has been installed on the new WS00 host and it is ready for use, please reinstall the "eucalyptus-walrus" package.
    • Synchronize the DRBD configuration (/etc/drbd.conf and /etc/eucalyptus/drbd*) from the WS01 host.
    • On WS00, re-configure DRBD by following the Configure DRBD section of the Installation Guide and performing the steps that are relevant to the secondary Walrus server (WS00 is the new secondary Walrus server, in this example).
    • Re-register WS00 with a new host name if necessary. This will synchronize keys.
  5. On WS00, execute the following command:
    drbdadm -- --discard-my-data connect
    WARNING
    WARNING
    This command will discard all data on WS00 and synchronize data from WS01.
  6. Monitor the state of DRBD by entering:
    watch -n 2 cat /proc/drbd
    WS01 should be marked as the primary and WS00 is the new secondary. Wait until data is synchronized.
  7. When the data on WS00 is synced from WS01, start the eucalyptus-cloud process on WS00.
  8. Monitor service status by running "euca-describe-services" on the primary CLC and ensure that WS00 is DISABLED and WS01 is ENABLED.
At this point, the Walrus service is back in HA mode.