Failover cluster

The failover cluster is there to increase the availability of your monitoring installation by protecting a device or individual components against hardware failures. The clustering is not a replacement for data backup.

The cluster ensures a shorter downtime in the following situations:

  • If the RAID in a Checkmk rack1 or the SD card in a Checkmk rail2 is no longer accessible, the inactive node takes control of the resources.
  • If the device that has been active until now can no longer be accessed (has failed), the inactive node takes control of the resources.
  • If the active device can no longer access the “external” network and, unlike the inactive node, has a connection to this network, the inactive node takes control of the resources.
  • If you carry out a firmware update, you can update the nodes individually. While one node is being updated, the other node will continue performing the monitoring.

1. Prerequisites

In order to build a cluster, you first need two compatible Checkmk appliances. The following models can be clustered with one another:

  • 2x Checkmk rack1
  • 2x Checkmk rail2
  • 2x Checkmk virt1
  • 1x Checkmk rack1 and 1x Checkmk virt1

In addition, the two devices must use a compatible firmware, and at least version 1.1.0.

The devices must be wired with at least two mutually independent network connections. It is recommended to use as direct a connection as possible between the devices and to make a further connection over your LAN.

To increase the availability of network connections, you should, instead of using two connections via individual network connectors, create a bonding configuration that uses all four network connectors of Checkmk rack1. Use the interfaces LAN1 and LAN2 for the connection to your network and the interfaces LAN3 and LAN4 for the direct connection between the devices.

2. Migration of existing installations

Devices that were delivered and initialised with the firmware version 1.1.0 or higher can be clustered without migration.

Devices initialised with earlier firmware must first be updated to version 1.1.0 or higher. The factory settings of the device need to then be restored, preparing the device for clustering. Please note that, in order to prevent data loss, you must back up your data from the device and then restore it.

3. Configuration of the cluster

This guide assumes that you have already pre-configured both devices to the extent where the web interface can be opened with a web browser.

Before actually setting up the cluster, you must first prepare both devices. This mainly involves adapting the network configuration to fulfil clustering requirements (see prerequisites).

The configuration of a cluster with two Checkmk rack1 is shown in the following. A cluster is built which looks as shown in the diagram below.

The interface designations LAN1, LAN2 etc. used in the diagram correspond to the designations of the physical interfaces on the device. In the operating system, LAN1 corresponds to the device eth0, LAN2 to the device eth1 etc.

This configuration complies with the recommendations for the clustering of two Checkmk rack1. You can of course use IP addresses in your environment that suit it. Make sure however that the internal cluster network (bond1 in the diagram) uses a different IP network to the “external” network (bond0 in the diagram).

3.1. Network configuration

Open the web interface of the first node, select Device settings and Network settings at the top. You will now be on the network settings configuration page. There are two modes available to you here. The Basic mode, which you can only use to configure your device’s LAN1, is activated by default.

The Advanced mode is required for clustering. In order to activate this mode, click on the button Advanced mode at the top and confirm the security prompt.

All network interfaces available in the device will be shown to you on the following page. Only the interface eth0 (corresponding to LAN1 ) will currently have a configuration, which was applied by the Basic mode.

Now create the first bonding interface bond0 by clicking on Create bonding. For this purpose, enter into the dialogue that follows all data as shown in the diagram below and confirm the dialogue with Save.

Now create the second bonding interface bond1 with the appropriate configuration.

After you have created the two bonding interfaces, you will be able to review all settings made in the network configuration dialogue.

Once you have successfully completed all configuration steps, make the settings effective by clicking on Activate changes. The new network settings will then be loaded. After few seconds, the network configuration will look like this:

Now, with the appropriate settings, repeat the configuration of network settings on your second device also.

3.2. Host names

Devices to be connected in a cluster must have different host names. You can specify these now in the device settings. In our example, we configure node1 as a host name on the first device and node2 on the second device.

3.3. Connecting the cluster

Having completed preparations, you can now continue setting up the cluster. To do this, open the Clustering module in the main menu of the first device (here node1) in the web interface and click on Manually set up cluster.

Now enter the appropriate configuration in the cluster creation dialogue and confirm the dialogue with Save. If you require more information about this dialogue, click on the icon beside the MK logo in the top right-hand corner. Context help will then appear in the dialogue explaining the individual options.

On the following page, you can connect the two devices to form a cluster. To do this, you need to enter the password of the web interface of the second device. This password is used once to establish the connection between the two devices. Then confirm the security prompt if you are sure that you want to overwrite the data of the target device with the IP address displayed.

Once this connection is successful, cluster setup is commenced. You can have the current status displayed on the cluster page.

As soon as the cluster has been successfully built, the synchronisation of monitoring data will start from the first to the second node. While this synchronisation is still taking place, all resources, including any monitoring instances you may have, will be started on the first node.

From now on you can, using the cluster IP address (here, access the resources of the cluster (e.g. your monitoring instances), regardless of the node by which the resources are currently being held.

4. The state of the cluster

When the first synchronisation is complete, your cluster will be fully operational. You can view the state at any time on the cluster page.

Using the status screen on the console, you can also view the current state of the cluster in the Cluster box in summarised form. The role of the respective node is shown after the current status with (M) for the master host and (S) for the slave host.

5. Special cases in the cluster

5.1. Access to resources

All requests to the monitoring instances (e.g. web interface access) as well as incoming messages (e.g. SNMP traps or syslog messages to the event console or requests to Livestatus) should normally always be sent via the cluster IP address.

Only in exceptional cases (e.g. diagnostics or updates of a particular node) should you need to access the individual nodes directly.

5.2. Device settings

The settings (e.g. time synchronisation or name resolution settings) that have been made independently on the individual devices until now, are synchronised between the two nodes in the cluster.

However, you can only execute these settings on the node that is active at the time. The settings are locked on the inactive node.

There are some device-specific settings, (e.g. those of the management interface of the Checkmk rack1) which you can adapt to the individual devices at any time.

5.3. IP addresses or host names of the nodes

To be able to edit the IP configuration of the individual nodes, you must first disband the connection between the nodes. To do this, click on Disband cluster on the cluster page. You can then adapt the desired settings via the web interface of the individual nodes.

Once you have made the adjustments, you must now select Reconnect cluster on the cluster page. If the nodes can be successfully reconnected, the cluster will resume operation after a few minutes. You can see the status on the cluster page.

5.4. Administrating Checkmk versions and monitoring instances

The monitoring instances and Checkmk versions are also synchronised between the two nodes. You can only modify these in the web interface of the active node.

If, to do this, you also access the cluster IP address directly, you will always be referred to the device with which you can configure these things.

6. Administrative tasks

6.1. Firmware updates in the cluster

The firmware version of a device is not synchronised in cluster operation. The update is thus carried out for each node. You have the advantage however that one node can continue performing the monitoring while the other node is updated.

When updating to a compatible firmware version, you should always proceed as follows:

First open the Clustering module in the web interface of the node to be updated.

Now click on the heart symbol in the column of this node and confirm the security prompt that follows. This will put the node into maintenance state.

Nodes that are in maintenance state release all resources currently active on the node, upon which the other node takes control of them.

While a node is in maintenance state, the cluster is not failsafe. So if the active node is now switched off, the inactive node in maintenance state will not take control of the resources. If you now additionally put the second node into maintenance state, all resources will be shut down. These will only be reactivated when a node is taken out of maintenance state. You must always remove the maintenance state again manually.

If the cluster page shows the following, you will see that the node is in maintenance state.

You can now perform the firmware update on this node, as on standalone devices also.

After you have successfully performed the firmware update, open the cluster page once more and remove the maintenance state of the updated device. The device will then automatically merge into cluster operation, upon which the cluster becomes fully functional again.

It is recommended to run the same firmware version on both nodes. You should therefore repeat the same procedure for the other node next.

6.2. Disbanding clusters

It is possible to disband the nodes from a cluster and continue running them separately. When doing so you can continue using the synchronised configuration on both devices, or reset one of the devices to factory settings and reconfigure it for example.

You can remove one or both nodes from the cluster during operation. If you wish to use both nodes, you must ensure that the data synchronisation is in good working order beforehand. You can verify this on the cluster page.

In order to disband a cluster, click on Disband cluster on the cluster page of the web interface. Read the text of the confirm prompt that follows. In the different situations, this text contains information as to the state the respective device is in after the disconnection.

The disconnection of the devices must be carried out on both nodes separately, so that both devices can be run separately in future.

If you only wish to use one of the devices in future, disconnect the cluster on the device you intend to continue using and then restore the factory settings on the other device.

Once you have disconnected a node from the cluster, the monitoring instances will not be started automatically. If you wish to start the monitoring instances, you need to do so next via the web interface.

6.3. Exchanging a device

If the hard drives of the old device are in good order, you can take these from the old device and insert them into the new device, wiring the new device in exactly the same way as the old device was wired and then switching it on. After starting, the new device will merge into the cluster in the same way as the old device.

If you want to completely replace an old device with a new one, you should proceed in the same way as when disbanding the cluster completely (see previous chapter). To do this, select one of the previous devices, disconnect this device from the cluster and create a new cluster with this device and the new device.

7. Diagnostics and troubleshooting

7.1. Logging

Cluster administration is a largely automatic function, whereby automatic processes on the nodes decide which device which resources are to be started and stopped on. This behaviour is logged in the form of detailed log entries. You can access these entries from the cluster page by pressing the button Cluster log.

Please note that these entries, just like the other system messages, are lost when restarting the device. If you would like to keep the messages for longer, you can download the current log file over your browser or set up a permanent forwarding of log messages to a syslog server.