Ep. 7 (part 1): Working with rules and setting thresholds in Checkmk

[0:00:00] Welcome back to the Checkmk channel and in this episode, we are covering a slightly more complex topic namely how you can parameterize your hosts and services for example by setting thresholds and how you can use these thresholds with one of the features of Checkmk with which it distinguishes itself from other products namely the rule-based configuration.
[0:00:22] Because this episode is a bit more extensive, we will split this up into three parts and in this first part, I will show you how to work with rules and how to set thresholds in the second episode, we will take a look at host tags and how you can use those to create more complex rules and in the third episode we are looking at folders and how you can use them to group hosts together and create a more elegant and clearly structured setup.
[0:01:01] In this episode, we'll take a look at how to set up parameters for services.
[0:01:05] What are these parameters? 
A parameter is for example, the threshold that we use to indicate when a server should go to the Warning or the Critical state. Checkmk has many of these thresholds for you pre-configured but often you want to modify them to better fit to your installation or your environment.
[0:01:27] Take for example the file system check, where sometimes you might want the service to go to, 'Warning' when 80% of the available storage room is used and in other cases, you might want to set that threshold 95%.
[0:01:39] But there are more parameters, for example how often Checkmk should check the state of a certain service or in which time window you want notifications to be sent out. But first, let's take a look at how to set up threshold so when a server should go to Warning or Critical. And to do this I will use the CPU load as an example. Let's create a rule specific for this host.
[0:02:05] Every rule has this general section with a description and a comment these fields are all optional so you don't need to fill them in. I will give my rule a name, "CPU load of monitoring server".
[0:02:26] Now next up every rule has a value and conditions. Let's start with the value. So these are the actual thresholds when a service should go to Warning or Critical.
[0:02:38] In this case it's still the default value so 5 and 10. But let's change that I want a warning when there will be one process running per core and it should go to critical when it's three processes.
[0:02:57] Next up, Condition. So at the first glance, this might seem a bit complex but it's easier than it looks right now Checkmk pre-filled the explicit host because we created a host explicit for we created a rule explicit for this host. I'm not going to change that I'm just going to save it. And of course activate the change.
[0:03:30] Now let's see if it worked, so let's go to back to all hosts to our monitoring server and once again we'll open the parameters for the service.
[0:03:44] You see here that rule 1 in the main directory is applied to this service and here you see the values that we set. But how do we know that what we configured is working? To do that I will just do a simple trick on a command line. I will create an endless loop: while true; do true; done &.
[0:04:09] This will make sure that one processe running continuously and this will increase the load by exactly one. The load is the number of processes that are currently active on a core. If I repeat this process and look at top you'll see that there are now quite a few of these shell processes running and the load average is slowly rising it will take some time and it will take some time to rise and this first number that's the load average of the last minute.
[0:04:51] The second number is the average of the last five minutes and the last number is the average of the last 15 minutes. If I now go back to the UI you see the service is still okay.
[0:05:05] That's because of the check interval but we can reschedule the Checkmk service so then our service should go to 'Warning'. And now you see that the state goes to WARN. If I click on the name of the service we go to the details and you can see here in the graph that the cpu load has started rising.
[0:05:28] You can also see the thresholds. The yellow line indicates the warning threshold and a red line the critical threshold here you see two and six but we configured one and three why is that because we configure the thresholds per core and this VM has two cores so all the values are multiplied by 2.
[0:05:49] Let's reduce the CPU load again. So we are stopping the processes that we created.
[0:05:56] We simply do that by using "fg" to get the process to the foreground and now we just stop them by pressing Ctrl+C. And let's repeat that.
[0:06:19] Okay now let's check "top", just to make sure that nothing is running.
[0:06:23] Okay all the processes have been stopped now and the cpu load should now slowly reduce. Now we're going to create a second rule but this time for a file system. The difference is that there is only one CPU load service on a host but for a file system there can be multiple.
[0:06:43] And that's why when you create a rule you need to specify exactly for which file system you are creating a rule. For this example, we'll be creating a rule for the root file system of our monitoring server. So once again let's go to the services of the Checkmk server host.
[0:07:02] This time we'll take a look at the file system service so that's the root file system of this server. Once again we go to parameters for the service. Here you can already see that this rule-set is a bit more complicated than the last one.
[0:07:22] The rule set is called: Filesystems (used space and growth), and this is where we set the parameters for the file system. So let's click on it. Now we get to a very similar view as before, the only change is that the button now says "Create mount point specific rule for:". And this would create a rule specifically for the host "checkmk_server" where the mount point is "/".
[0:07:55] So let's create that rule. Now this time I will leave the properties empty because it's optional I will only set the value and the conditions if you look at the conditions you see that it's already pre-filled so the host is "checkmk_servers" and the mount point is "/", and then you see this "$" sign. This "$" sign is there because it uses a regular expression and a "$" sign, means, end of string.
[0:08:28] If we would not have this "$" sign here that would mean that this rule would be applied to every file system service on the "checkmk_server" host where the mount point starts with a "/". Which is every mount point so it would apply to every single file system rule on this host. Now let's look at the value so you see that there's a lot more going on here that's because the file system service is quite powerful and you can do a lot of things with it. But often you want to use this first item levels for file system. So with this we can set thresholds for a percentage of the storage capacity that's currently being used.
[0:09:18] You could also use it for remaining free space, but I will stick to used space for now. So I'm gonna set pretty low values so that we can see if the state of the service will go to WARN or Critical. So let's set it to 2 and 5 for example. Now let's save and as always activate the changes.
[0:09:51] Now let's head back to the services of the host. So the "checkmk_server" and then we reschedule the check to see if it updates. Okay. so now you see that this service is already Critical. And you can also see here the thresholds of the service. So 2% and 5% to Warning and Critical thresholds.
[0:10:24] Okay so now we want to make a third rule, but this time for network interfaces so network cards on servers or switch ports and network connections on switches and routers. These are in Checkmk covered by the same check called interface followed by the name of the interface or the number of the port. I also want to create a rule that applies to multiple services and let me show you how that works. Let's go to the switch we added in a previous episode switch1.  
[0:10:54] You see that there are quite a few interfaces already. Let's just pick one for example number 17. Once again we go to parameters for this service. And the rule set for this one is called Network interfaces and switch ports, so let's click on that.
[0:11:15] So this screen should now look familiar to you.
[0:11:19] But this time the button says create port specific rule 4. So this would create a rule for a service on host switch1 and on a specific port called 00017. So let's create it. And once again you see some information pre-filled in the conditions area, so you see here the switch1 host and the port is already pre-configured. If I were to uncheck the port specification then this rule would apply to all interface services on the switch1 host. And if I also uncheck this one then this rule would apply to all interface services of all hosts.
[0:12:14] Under values you once again see a lot of options but we will use 'Used bandwidth (minimum or maximum traffic)'. So let's check that box and here you can add multiple elements.
[0:12:29] So you can add an element and then configure if it's for in-or-outgoing traffic for upper or lower values. By selecting lower you could, for example, configure the rule to warn you when there is no traffic on the interface at all, but I will set it for outgoing traffic and upper values. You could also set it for absolute levels in bits or bytes per second but we'll stick to 'Percentule levels'.
[0:13:02] And I will set the warning at 10% and 20%, and critical at 10%.
[0:13:18] So let's save the rule and now activate the changes. But what happens when you create two rules for one service? Well there is a very simple principle in Checkmk the first rule which conditions are met sets the status of the service, all the rules coming after that will simply be ignored. And you can use this principle to create one general rule for all services of a specific type and then create one for an exception. After that you just need to make sure that the exception is placed before the general rule. Let me show you an example. So let's go back to the switch this time we want to create a different rule for interface with port 18. So once again go to parameters for the service.
[0:14:22] You see that there is already one rule in this rule-set. And now let's create a second one, so now I'm going to create another one. But under conditions, I will leave the port 18.
[0:14:38] The only thing I will change is I will set different values again under 'Used bandwidth'. And this time I will once again do outgoing upper values but I will set the values to 80 and 90%. Let's save this. As you can see we now have two rules this color at the beginning of each row indicates which rule is currently being applied. The green color means that the conditions are being met and that it's applying parameters, the yellow color means that the conditions are being met but the parameters are already applied by an earlier rule and this is not what we want because we want our exception to be considered before the general rule. So what we need to do is change the order so we can do that by simply dragging the row here on this arrow icon. So let's move it before the general rule and now you can see that our exception rule is green and it's applying the parameters and our general rule is not being applied. So now we can activate the changes.
[0:16:02] So now let's go back to the switch and look at another service. So let's pick for example this one on port 5. Again go to parameters for this service and let's go back to the rule-set. Now I want to show you a different way you can get to the rules. Not over a specific service but through the setup menu.
[0:16:30] So let's go to the setup menu and in the search box type in "rules". Now you see different categories of rules so you have host monitoring rules, hardware and software inventory rules, the service monitoring rules we've just been working with, service discovery rules agent rules and SNMP rules. But we are looking for a host monitoring rule and namely the 'host check command'.
[0:17:01] We want to find a solution for the following case: let's say you want to check the status of a host but you're unable to ping that host, the normal check to test the availability of a host is a ping. So now let's change that to a TCP connect on port 80. To do this we go to the host check command, you see that there is already a rule here about docker but let's not get into that. 
[0:17:28] Let's simply create a new rule. For now, we leave the conditions as is, you could configure a host tag here to only apply this on a specific host but we'll get into host tags in the next part. For now, let's just change the smart ping to a TCP connect on port 80 and let's save that. And then we should be able to use the TCP connect to see whether a host is up or not. Generally speaking, you can say that in Checkmk everything which is configured for hosts and services is based on rules. And now you know how to configure this by yourself.
[0:18:13] In the next part, we're taking a look at host tags and how you can use them to create rules that would apply for example to all your Linux servers at once. That's it for part one, if this was helpful to you please like the video, subscribe to the channel, see you in part two.

Want to know more about Checkmk? Join us for our Introduction to Checkmk Webinar

Register now

More Checkmk Videos