Ep. 8 (part I): Working with rules & setting thresholds

To load this YouTube video you are required to accept advertising cookies.

Note: All the videos on our website offered in the German language have English subtitles and transcripts, as given below.

[0:00:00] Welcome back to the Checkmk Channel.
[0:00:03] Today we want to tackle a somewhat more complex topic, namely how you can parameterize your hosted services, how you can set threshold values, for example.
[0:00:11] Here Checkmk has a very special system, which also distinguishes it from the other products, namely that its configuration is rule-based, and because, as I said, this topic is a bit more extensive, we have made three videos to cover it.
[0:00:24] In the first video we will show you how to work with rules, and how to set threshold values. In the second part we will deal with the so-called Host Tags these are the characteristics of hosts that you can use to build the rules even more intelligently, and in the third part we will look at how you can manage hosts in folders, which also helps to make the configuration a little more elegant.
[0:00:51] Working with rules and setting thresholds. So today we will look at how you can set the parameters for services.
[0:01:01] What are parameters?
[0:01:02] First the most important question is – when should the service actually go to WARN & CRIT that is, when do you actually want to identify a problem? Checkmk has very reasonable default settings, but there will always be situations where you have individual cases where you want to change the defaults, for example for a file system: at what point should the service go to WARNING? There are a couple of other parameters that can also be set, for example – how often should a service be checked? So, a check every how many minutes?
[0:01:27] The default value is 1 minute.
[0:01:30] Also for example, in which time window should the alarm be given? These are the questions.
[0:01:33] We will first deal with the threshold values - the question of when a service should go to WARN or CRIT. And to do this, let's just take as an example the CPU-Load service on our monitoring server itself.
[0:01:45] So, I'm back on the small monitoring system that we built in the previous episodes, I then go to these two hosts, and there find my monitoring server, and here is the service CPU load, which measures the current CPU load.
[0:02:04] Load is not the same as utilization, so it is not a percentage utilization. I will say a few more words about this later. And here I would like to define the threshold values from when this service should go to WARN and CRIT.
[0:02:17] There are various ways to do this, the simplest is that I go to this menu, and then to the 'Parameters for this service' were I come to single list of all of the possible parameters.
[0:02:29] The first box is interesting, and here you can see the second line CPU load (not utilization).
[0:02:38] This is clickable you can also see here that it is currently on the Default Value, namely on the value 'WARNING at 5' – a load of 5 per core, and CRITICAL at 10 per core – therefore relatively high values.
[0:02:51] To change all of this, I go here to CPU load, and come to a so-called rule sets.
[0:02:59] This rule set is now empty, which means there is no rule for CPU load yet, so I can either create a rule specifically for this host, or I can create a rule that applies for all hosts in general.
[0:03:13] I can now take the step where I create a rule only for the host 'mycmkserver'.
[0:03:21] Each new rule has general properties at first, these are the Rule Properties. Here is a description - a comment – these are all optional fields.
[0:03:31] But here I can simply enter – 'CPU load on the monitoring server' here if I want to, and then just close the panel now.
[0:03:43] The most important thing now is that with every rule is always a condition and a value.
[0:03:48] Let's start with the value. The latter now specifies the threshold values. You can see here that WARN is set to 5 by default and CRIT from 10.
[0:03:57] Now we just make a test, in which I say with a load of 1 I would like to have a WARNING, and from 3 per CPU core it should be CRITICAL.
[0:04:06] That means per single core. If you have a system with 4 cores, the warning would then be from 4. These conditions initially look very complex.
[0:04:19] What is already pre-filled here is that this rule only applies to the host mycmkserver. No other conditions have been set yet, and I'll just leave that for now and simply save it.
[0:04:31] So, we now have a change, you already know about that from Activate Changes. I'm going to this single change here, activate it, and from that moment this change, and thus the new threshold become active.
[0:04:44] As a control, I again do the same as before, I go the same way again, to this server, to the CPU load, to the threshold values, and to the parameters, and now you can see that something has changed. Rule 1 is now here in the Main directory – thus the first rule in this main directory, and here is the threshold value that has now come out. This is actually exactly what I wanted that from 1 or from 3 it goes to WARN or CRIT, respectively. Now one can of course ask – How can I produce this condition now?
[0:05:17] I just perform a little trick, and I go to the console.
[0:05:21] I am now logged on to my monitoring server and will now do something that you should probably only do if you know what you are doing. Namely, I'm creating an endless loop in the shell here. This means that a process will now calculate continuously, and the CPU load increases by exactly one. The load is namely the number of processes that are currently calculating, and so that I can get them up a bit, I just do a few more, and now have an incredible number of processes that run and calculate in the background.
[0:05:50] I can look at it with top, and you can see that there are a lot of shell processes running, and up here you can already see that this load average is just starting to increase.
[0:05:59] Here it goes from 1.9 to 2.6 - these are average values, so it takes some time until they run up. By the way, here is the average over the last minute, over the last five minutes, and over the last 15 minutes.
[0:06:13] I now return to the monitoring system and take a look at this CPU-load service – it is still OK, it only runs once per minute, but what I can now is to speed it up if I am bored is, to go back to this Menu and say 'Reschedule Check_MK service', so that this service will also be recalculated.
[0:06:37] So, if all goes well, it should go to WARNING, or maybe even to CRIT, depending on how much the CPU load has increased.
[0:06:50] As we can see it is now on WARNING. We have a 15-minute CPU-load of 1.7 .
[0:06:57] I can again go into the details for this service, and when I look at the graph, I can see here that the CPU-usage is slowly starting to increase.
[0:07:10] By the way, you can also see the threshold values now plotted in the graph – which is actually very practical – that you can now see that 1 and 3 are the threshold values for WARN and CRIT respectively.
[0:07:21] So, it behaves in exactly the way we want it to, and so we are done with it at this point.
[0:07:28] So that my poor monitoring server can catch its breath again, I should of course now end all of these test processes. I'll do that quickly by simply stopping Top and bringing it back to the foreground with fg, and simply ending with CTRL+C and then it should be - good.
[0:07:57] That was the last one, I’ll go back to top – the processes have disappeared, and the load will slowly go all the way back down.
[0:08:06] Let's now make a second rule, this time one for a filesystem. Here there is a small difference, namely the CPU-load is a thing that only exists once per server.
[0:08:17] In the case of file systems, however, there can of course be several file systems on one computer, which is why I usually have to specify which of the file systems I mean - if it is supposed to be a particular one.
[0:08:29] And now we do that using the example of the root file system on the the monitoring server.
[0:08:34] I now go back to my monitoring server, and this time I take the service Filesystem / – in this case the root file system on the monitoring server.
[0:08:44] Again here I go the same way. I'm going to the Parameters for this service, now you can see up here that this second line is already incredibly complex.
[0:08:54] Filesystems (used space and growth) is the name of the rule set with which you define threshold values for file systems.
[0:09:02] If I go to that now, I will see the same thing as before, but there is however a little specialty: here it says 'Create Mount points specific rule for'.
[0:09:12] Now what does that mean?
[0:09:15] This rule would apply to the host mycmkserver and for the service with the file system that monitors the mount point /. With Filesystem it is the case that under Unix and Linux there is a distinction at the mount point, and the rootfile system is the /, which means that if I create a rule here, it will apply only to this one service.
[0:09:39] So, I'm going to click on 'Create Mount point specific rule for'.
[0:09:46] And now I see the same thing again, I have the rule properties. I leave these empty, now because they are optional, and I just have a value and a condition.
[0:09:54] Let's look at the conditions first.
[0:09:57] The condition now contains Explicit hosts, so it should only apply explicitly to this host.
[0:10:04] And here Mount Point is ticked, only for this Mount Point. The special thing is now that here / is followed by $.
[0:10:11] Why is that the case?
[0:10:13] These are so-called regular expressions. We may make a new video about regular expressions there is also a description in the manual. Usually it is enough, to know that the description that you enter here matches the beginning of the service name.
[0:10:30] If I omitted this $ sign, it would apply to all mount points that begin with /, in this case these are all existing mount points, so it would make little sense.
[0:10:39] Dollar is short for the end of this service name. That said, this rule will only apply to the file system called exactly /.
[0:10:50] Let's take a look at the value please do not be alarmed, there are an incredible number of options for monitoring file systems in Checkmk. This is a very complex, powerful check, and there are a lot of things you can do with it, up to Trend computation, where you can see, for example, whether the file system is growing over time.
[0:11:13] In most cases, you just need this first point, which I tick now and you can then use it to set threshold values.
[0:11:23] The easiest thing of all is simply 'Levels for filesystem used space'.
[0:11:27] This means the threshold values for the percentage of occupied space.
[0:11:31] And here 80% und 90% are suggested, which are also the default threshold values, if you have not set anything.
[0:11:36] Conversely, you could say that you want the threshold values to be free space, then you would choose the second option.
[0:11:42] I'm going to stay with the first option, and will set a very, very low value, so that I will simply be able see if the whole thing goes to WARNING.
[0:11:50] In this case, for example, I set WARNING at 5 % and CRITICAL at 10%. Of course, these thresholds are very low but I am only doing it to try the function out.
[0:12:03] I save it, activate the change and when the process has gone through, I can choose this service again.
[0:12:12] I do this with Quicksearch: Filesystem /. Since I only have one host in it, I find only one service here.
[0:12:20] I am again impatient and am doing a reschedule, here so that the next check is carried out and, as you can see, it immediately turns CRITICAL.
[0:12:27] Incidentally, this service also shows that the threshold values here are 5% and 10% respectively.
[0:12:35] Well, let's make a third rule. This time it's about network interfaces, network cards on servers and switch ports, or the network connections of routers and switches.
[0:12:45] In Checkmk they go through the same check. This check called Interface and its name comes from that of the interface or the port. And now I want to create a rule that applies not only to one service, but to several.
[0:12:58] And I will show you how that works. It is really not that difficult either.
[0:13:02] I will start by going to my switch, which I can see has a lot of ports. We just take any one of them – for example, the 17th – and go. back to the 'Parameters for this service'.
[0:13:15] This time the rule is called 'Network Interfaces and switch ports'. I'll click here and, come back to the page you already know.
[0:13:24] Here is called 'Create ports specific rule for', a rule only for a specific switch port. When I now create the rule here, you can also see below that the host name has already been filled-in and that port 17 is entered here. Again with the dollar sign at the end so that it doesn't just apply to the prefix.
[0:13:49] So now I can simply take the cross out.
[0:13:51] At the moment it applies to all interfaces of switch1, or I can remove this cross, which means that it will then apply to all switches and all ports, on all hosts.
[0:14:04] With the Value you can see here again that this whole thing is extremely complex. You can do a lot there, I will now just, do the Used bandwidth, which means that I want to set threshold values for when a certain volume of traffic runs on the port.
[0:14:21] And here you can also see that I can make several entries and specify for each one whether I only want the incoming or the outgoing traffic, for example. I will take the outgoing now. And you can not only set an upper limit, for example, but also a lower limit, so that you will also get a warning when there is no more traffic on this port at all.
[0:14:41] I will just set percentage threshold values of 10% and 20% of the bandwidth, you can also set threshold values in bits-per-second, and many other things as well.
[0:14:55] So if I save here again and activate the change, I will have defined these threshold values for all switch ports.
[0:15:07] What actually happens when you create two rules for a service?
[0:15:09] Which one now applies? There is a very simple principle in Checkmk: The first rule, the conditions of which are met, applies and sets the value.
[0:15:17] If more come afterwards, they will simply be ignored, and this principle can be used to form a concept of general principles and exceptions. I can actually create two rules: one for the general value and one for specific exceptions.
[0:15:32] I just have to make sure that this rule comes before the others. Let's try an example: I will go to my switch and make a rule for a single specific port.
[0:15:44] So, let's say for example that I want to set something different for port 18 than for the other ports. So, here again I return to the rule set.
[0:15:53] As you can see, there is already a rule that we created earlier, and I am now creating another rule, I now leave port eighteen in the conditions, and switch1, which means that it should only apply to this one port.
[0:16:11] And now, for example, set a Used Bandwidth that differs from the other by setting upper thresholds of 80% and 90%.
[0:16:23] When I now save this rule, you will see that we now have two rules. These colored dots here show you for this host switch1 and this port 18 – where we now are – which rule applies.
[0:16:37] For example, we can see that this first rule applies, the condition has been met, and it defines parameters.
[0:16:44] The second rule is yellow, which means that the condition would be met, but the parameters are already defined by the first rule. It is therefore ineffective. That is of course pretty senseless, because this should be the exception, and the exception should of course always take precedence over the general rule.
[0:16:58] So, I'm going to take this crosshair here and move the rule to the first position. And now you can see the top rule is green again, but that now the other rule, which is: First comes the exception, which now defines the value here so that the general rule does not apply in this case.
[0:17:16] I go back to a switch and look at some other port, for example interface 5, go back to the rule set.
[0:17:27] Now you can see that rule two applies to this port, if that interests me, I can go back to the rule set and can see that the first rule is now grey which is also obvious because not every condition applies since I am at port 5 and this rule only applies to port 18.
[0:17:45] That means it defaults through to the second rule, which is precisely this general rule that sets my general threshold values.
[0:17:51] And so I can even easily get a grip on even complex environments, since I only have to explicitly define the exceptions, and can define default values for all of the other settings.
[0:18:01] There is another way to the rules that I would like to show you now, and this does not get in via a specific host, but instead via a configuration module, which lists all of the rule sets that are available.
[0:18:14] You can find it here at Host & Service Parameters.
[0:18:17] First of all, there is a division into many different categories, and then there are rule sets in each category.
[0:18:22] What I want to do now is in the Monitoring Configuration, there you will find a rule and that is the 'Host Check Command'. I would like to find a solution for the following case: Let's say you want to check a host, or a firewall – something like that – but you can't ping it, so I would now like to say that the check which checks the host - which would normally be a ping – will be changed into a TCP connect to port 80.
[0:18:51] For this I go into this host check command rule set.
[0:18:56] You can already find a predefined rule here that deals with Docker but I don't want to go into that in more detail now - I'm just going to create a rule here, and I'll leave the Conditions empty for now.
[0:19:08] You could now explicitly enter certain hosts - you know that or we will do it with the Host-Tags, which I will describe in the next episode but now I will specify as a Host Check Command that I do not want to have a ping, but rather I want a TCP connection test on port 80, that is, in order to test whether the host is alive, an attempt is made to connect to port 80 - in other words to the HTTP service which may be currently running on this host, and here too I can save the rule.
[0:19:35] And the rest is actually as usual.
[0:19:39] In general, you can say that everything for hosts and services that is configured in Checkmk is based on rules, and you have now seen how you can set this up.
[0:19:49] In the second part of this episode we will deal with the so-called Host Tags, where it is very easy to formulate conditions, for example, that the following conditions should apply to all of my production Linux servers.
[0:20:01] Stay tuned, we'll see you soon for the second part.

Want to know more about Checkmk? Join us for our Introduction to Checkmk Webinar

Register now

More Checkmk Videos