Video recording and production done by OpenStack Foundation.
Compute node high availability means when hardware or network fails, or the host operating system crashes, the node should be fenced and shut down, and instances on the node get relocated and rebooted on other compute nodes. In a vanilla OpenStack deployment, it demands the tenant workloads provide falt-tolerance and failover abilities. While this assumption is true for modern clustered applications, there are still many IT solutions in traditional industries relying on compute node high availability. This makes a barrier of OpenStack deployment in traditional enterprise IT.
In small deployment, it's tempted to setup a monitoring service for compute nodes, and call Nova host-evacuate when a compute node fails. However the monitoring service itself becomes the single point of failure and hot point. There are also proposals or maybe implementations to use ZooKeeper and Pacemaker (with Pacemaker-remote) because they provide heartbeat and membership service. The basic idea is that we have compute nodes register as ephemeral znode, and ZooKeeper maintains a heartbeat. We can also run Pacemaker-remote on the compute node to achieve similar effect.
The problem is that the heartbeat usually runs on the OpenStack management network. However if a host has good storage network connectivity but just failed management network connectivity, we should not consider it failed and perform fencing and evacuation. Because failed management network connectivity just means we can not boot new instance on the server, and it does not affect the running instances, so evacuation will cause unnecessary downtime for the tenant workloads. On the other hand, if the management network is good for a host, but storage network fails, we should fence and evacuate the host. The ZooKeeper and Pacemaker-remote type of solution also suffers from the scalability problem, because the heartbeats happens between a few ZooKeeper/Pacemaker server nodes and many compute nodes.
Hence we propose a distributed health checking mechanism for compute nodes. It can deal with compute node power failure, host os crash, memory going bad, disk failure, interrupt of management/storage/tunnel network and so forth.
We use Gossip protocol for distributed heartbeat checking. The Gossip implementation comes from the consul project (consul.io). The main idea is to run an agent on compute node, and have them probe each other. The agent on the compute node can also check and monitor many types of things like OpenStack services and hardware status.
We run distributed heartbeat checking on all the OpenStack logical networks, usually management, storage and tunnel network, and report the network connectivity and other monitored status to the controller node, and let the controller node decide if we should fence and evacuate the node based on all these information. We present and discuss an example decision matrix. It's also possible to create a plugin for Ceilometer to report the data and events gathered from the distributed health checking, so admin can register alarms and add handlers, and decide what to do in a highly flexible way.
We also propose a fence mechanism based on custom Gossip query to complement IPMI remote powering off. In case of host power failure, it's not possible to distinguish IPMI network failure and actual power failure, the symptoms are the same. If we fail to ensure the power state of the failed host via IPMI, we send a custom Gossip query to the targeted node. Upon receiving the query, the target node sends ack and stops feeding the hardware watchdog and have the watchdog shutdown the host. If the node does not receive Gossip heartbeat from all the other nodes, it should fence itself and shutdown the host. In these way, the fence request either reaches the target host, or the host fences itself in case of network connectivity failure, or the host actually experienced power failure. From the perspective of controller, given a reasonable time, it can be sure that the failed host has been powered off, thus it will be safe to perform host evacuate.