Ep. 30: Clustering the Checkmk appliance

To load this YouTube video you are required to accept advertising cookies.

[0:00:00]	Welcome to the Checkmk channel. Today, I'm going to show you how to cluster the Checkmk appliance.
[0:00:15]	If you have one of our hardware appliances, you want to make sure you're resilient against hardware failure. To do so, we deliver the possibility to cluster two Checkmk appliances, so if one of those fails, the other one takes over.
[0:00:29]	A word of warning in the beginning in my demo here, you will see that I'm using virtual appliances. That is not something you want to do in production.
[0:00:39]	If you're running in production, the clustering feature aims at hardware appliances to make sure you are resilient against hardware failure and not in virtual environments. If you are running our appliance virtually, then the hypervisor will take care of the high availability for you.
[0:00:54]	So, without further ado, let's dive into the configuration. Okay, so first, we are going to create a network bond with the four available network interfaces.
[0:01:04]	To do so, we navigate to the Device Settings, Network Settings, and there we need to enable the advanced mode. Because here you can see we can only set one IP, one IP configuration, which makes sense if you're running in standalone mode or a set in a virtual machine.
[0:01:20]	But in this case we want to configure a bonding interface. So, we go to the advanced mode. We have to say that we really want to go there.
[0:01:32]	And now we can see all the four interfaces that we can use for configuration. Now to create our first bond, I'm going to click on Create Bonding.
[0:01:42]	We can use the default name, which makes sense. Then we want to use the first two interfaces as members of that bond. The default settings for Bonding-Mode and MAC address failover are fine here.
[0:01:54]	And now we want to set a IP configuration. So, I'm just going to use the IP configuration that you just saw. Because this is my first bond and I still want to access the web interface here.
[0:02:09]	And then we are good to go. So, I'm going to save these settings. And now you can see the changes the configuration was made but the changes haven't been activated. So, i'm just going to go ahead, create a second bonding with the remaining two interfaces.
[0:02:28]	The defaults are fine. We just need to add a simple IP configuration here. And this time with a different network. This one's important for clustering but I'm going to get to that later. So, let me just save this configuration too.
[0:02:43]	And now we can see all our network interfaces are members of a bond. We have two bonds here. Nothing is active at this point because we need to activate changes.
[0:02:54]	So, I'm going to do this right now. And then we need to move over to the second appliance and do the very same changes there too.
[0:03:05]	Okay, so as mentioned, now we are on our second appliance and there we need to configure the very same settings too, with different IP addresses of course.
[0:03:13]	But the settings all in all are the same because we want to have the same network configuration on both devices. So, again we have the same mode, the same configuration, all in all, the only difference here is the IP address that we are using.
[0:03:39]	And I'm just going to save that onto the second bond. You can see I'm a lazy person, so, I prepared everything, I just have to fill in the gaps. And that's that. So, we can activate the changes on the second device too. And then we can go back to the first device and start creating the cluster.
[0:04:04]	Okay, so after a few moments, we can see the bonds are active, up and running. Everything's looking good. So, we can head over to the cluster configuration. For that, we go back to the Main Menu and there you can see the Clustering menu here. So, we're going to open that.
[0:04:22]	We're going to click on create a cluster with a non discovered device. Okay, so first, we need the IP address of the partner. That's the 202 IP address you saw earlier.
[0:04:33]	The Data Sync Interface can stay at bond one. This is actually why we are using bonding interfaces with the hardware because in case a network link fails physically, then the bond will make sure all the traffic goes over the other interface and the communication is still possible.
[0:04:52]	Next, we have the Cluster Communication Interfaces. Those are the interfaces that are generally available for communication between the cluster nodes.
[0:04:59]	So, we want both bond interfaces there. Then we have to give the cluster a IP configuration, so the cluster address and under which the cluster will be available.
[0:05:12]	Of course, we need a Netmask here too. And for this communication, we will take bond0 as you can see the IP address here is in the 56 network and that was on bond0, so this is the configuration that we want to have here.
[0:05:26]	And then we need the ping targets. So, before I enter anything here, let me just quickly explain what it does. The ping targets help the cluster to decide whether a node has been isolated from the other node.
[0:05:38]	So, both nodes try to reach those ping targets and if the ping targets go down, one side of the cluster knows that it's isolated from the network and can take appropriate action.
[0:05:50]	So, if it's the standby side, that doesn't make any difference, it goes offline but the primary side keeps running. If the primary side goes down and realizes itself that it's been isolated, then it will stop action and the second node will take over.
[0:06:04]	So, in general you want these ping targets to be some high available IP addresses within your network, like a core switch domain controller, something like that. But it shouldn't be the gateway of the IP network that we are using here. Because both appliances will be able to ping this interface all the time.
[0:06:23]	So, I'm using it actually because in my demo environment, I do not have too much IP addresses to choose from. But make sure in a production environment, you give at least two or three high available IP addresses, that make sure the appliances can decide whether the cluster can be still working or if a failover has to take place.
[0:06:42]	Okay, that being said, the configuration is done. We can save the configuration. This takes a few moments.
[0:06:50]	Now we need to log into the remote appliance because, of course, we need to authenticate ourselves that we are allowed to overwrite the configuration of the second appliance.
[0:07:02]	That's what we do with the password here. Then we click on Connect. And after we clicked on Connect, we are asked one more time, if we really want to synchronize with the second appliance. Because it will overwrite all the data that is there, so we want to make sure we look out for the IP addresses if it's really the right appliance. I'm going to confirm that.
[0:07:27]	And then it takes a few seconds for the two appliances to communicate with each other to start synchronizing their state. So, if you click on Back, in the first moments, we will not see any information on the cluster state.
[0:07:42]	We just get the IP address of the partner and the cluster IP address. That's quite obvious information here. And if I refresh the site, then we see the cluster is starting to build, to be created.
[0:07:54]	At this point, the communication hasn't started yet, so let me just refresh one more time. And now we can see both nodes are shown as online. The network reachability is fine.
[0:08:08]	At this point, File Synchronization reports are known, synchronization is inactive. And if we update this page one more time, and now we can see everything is working.
[0:08:24]	One side of the cluster is already active and we could start creating our first site and start monitoring. But for good measure, we are going to wait on the synchronization process which we see here is synchronizing.
[0:08:36]	And it takes about 1 hour in this environment. That can depend on your environment, how long it will take to synchronize. And after that's done, everything will be green and we can start creating our first site. Okay, so now you can see the cluster status is completely green.
[0:08:57]	The data synchronization has finished and the cluster is up and running, so we could start creating our monitoring site and start monitoring.
[0:09:07]	So, that concludes today's video. Thanks for watching. Make sure to subscribe and see you next time.