Ep. 7 (part 1): Working with rules and setting thresholds in Checkmk
Read Video Transcript
|[0:00:00]||Welcome back to the Checkmk channel and in this episode, we are covering a slightly more complex topic namely how you can parameterize your hosts and services for example by setting thresholds and how you can use these thresholds with one of the features of Checkmk with which it distinguishes itself from other products namely the rule-based configuration.|
|[0:00:22]||Because this episode is a bit more extensive, we will split this up into three parts and in this first part, I will show you how to work with rules and how to set thresholds in the second episode, we will take a look at host tags and how you can use those to create more complex rules and in the third episode we are looking at folders and how you can use them to group hosts together and create a more elegant and clearly structured setup.|
|[0:01:01]||In this episode, we'll take a look at how to set up parameters for services.|
|[0:01:05]||What are these parameters? |
A parameter is for example, the threshold that we use to indicate when a server should go to the Warning or the Critical state. Checkmk has many of these thresholds for you pre-configured but often you want to modify them to better fit to your installation or your environment.
|[0:01:27]||Take for example the file system check, where sometimes you might want the service to go to, 'Warning' when 80% of the available storage room is used and in other cases, you might want to set that threshold 95%.|
|[0:01:39]||But there are more parameters, for example how often Checkmk should check the state of a certain service or in which time window you want notifications to be sent out. But first, let's take a look at how to set up threshold so when a server should go to Warning or Critical. And to do this I will use the CPU load as an example. Let's create a rule specific for this host.|
|[0:02:05]||Every rule has this general section with a description and a comment these fields are all optional so you don't need to fill them in. I will give my rule a name, "CPU load of monitoring server".|
|[0:02:26]||Now next up every rule has a value and conditions. Let's start with the value. So these are the actual thresholds when a service should go to Warning or Critical.|
|[0:02:38]||In this case it's still the default value so 5 and 10. But let's change that I want a warning when there will be one process running per core and it should go to critical when it's three processes.|
|[0:02:57]||Next up, Condition. So at the first glance, this might seem a bit complex but it's easier than it looks right now Checkmk pre-filled the explicit host because we created a host explicit for we created a rule explicit for this host. I'm not going to change that I'm just going to save it. And of course activate the change.|
|[0:03:30]||Now let's see if it worked, so let's go to back to all hosts to our monitoring server and once again we'll open the parameters for the service.|
|[0:03:44]||You see here that rule 1 in the main directory is applied to this service and here you see the values that we set. But how do we know that what we configured is working? To do that I will just do a simple trick on a command line. I will create an endless loop: while true; do true; done &.|
|[0:04:09]||This will make sure that one processe running continuously and this will increase the load by exactly one. The load is the number of processes that are currently active on a core. If I repeat this process and look at top you'll see that there are now quite a few of these shell processes running and the load average is slowly rising it will take some time and it will take some time to rise and this first number that's the load average of the last minute.|
|[0:04:51]||The second number is the average of the last five minutes and the last number is the average of the last 15 minutes. If I now go back to the UI you see the service is still okay.|
|[0:05:05]||That's because of the check interval but we can reschedule the Checkmk service so then our service should go to 'Warning'. And now you see that the state goes to WARN. If I click on the name of the service we go to the details and you can see here in the graph that the cpu load has started rising.|
|[0:05:28]||You can also see the thresholds. The yellow line indicates the warning threshold and a red line the critical threshold here you see two and six but we configured one and three why is that because we configure the thresholds per core and this VM has two cores so all the values are multiplied by 2.|
|[0:05:49]||Let's reduce the CPU load again. So we are stopping the processes that we created.|
|[0:05:56]||We simply do that by using "fg" to get the process to the foreground and now we just stop them by pressing Ctrl+C. And let's repeat that.|
|[0:06:19]||Okay now let's check "top", just to make sure that nothing is running.|
|[0:06:23]||Okay all the processes have been stopped now and the cpu load should now slowly reduce. Now we're going to create a second rule but this time for a file system. The difference is that there is only one CPU load service on a host but for a file system there can be multiple.|
|[0:06:43]||And that's why when you create a rule you need to specify exactly for which file system you are creating a rule. For this example, we'll be creating a rule for the root file system of our monitoring server. So once again let's go to the services of the Checkmk server host.|
|[0:07:02]||This time we'll take a look at the file system service so that's the root file system of this server. Once again we go to parameters for the service. Here you can already see that this rule-set is a bit more complicated than the last one.|
|[0:07:22]||The rule set is called: Filesystems (used space and growth), and this is where we set the parameters for the file system. So let's click on it. Now we get to a very similar view as before, the only change is that the button now says "Create mount point specific rule for:". And this would create a rule specifically for the host "checkmk_server" where the mount point is "/".|
|[0:07:55]||So let's create that rule. Now this time I will leave the properties empty because it's optional I will only set the value and the conditions if you look at the conditions you see that it's already pre-filled so the host is "checkmk_servers" and the mount point is "/", and then you see this "$" sign. This "$" sign is there because it uses a regular expression and a "$" sign, means, end of string.|
|[0:08:28]||If we would not have this "$" sign here that would mean that this rule would be applied to every file system service on the "checkmk_server" host where the mount point starts with a "/". Which is every mount point so it would apply to every single file system rule on this host. Now let's look at the value so you see that there's a lot more going on here that's because the file system service is quite powerful and you can do a lot of things with it. But often you want to use this first item levels for file system. So with this we can set thresholds for a percentage of the storage capacity that's currently being used.|
|[0:09:18]||You could also use it for remaining free space, but I will stick to used space for now. So I'm gonna set pretty low values so that we can see if the state of the service will go to WARN or Critical. So let's set it to 2 and 5 for example. Now let's save and as always activate the changes.|
|[0:09:51]||Now let's head back to the services of the host. So the "checkmk_server" and then we reschedule the check to see if it updates. Okay. so now you see that this service is already Critical. And you can also see here the thresholds of the service. So 2% and 5% to Warning and Critical thresholds.|
|[0:10:24]||Okay so now we want to make a third rule, but this time for network interfaces so network cards on servers or switch ports and network connections on switches and routers. These are in Checkmk covered by the same check called interface followed by the name of the interface or the number of the port. I also want to create a rule that applies to multiple services and let me show you how that works. Let's go to the switch we added in a previous episode switch1.|
|[0:10:54]||You see that there are quite a few interfaces already. Let's just pick one for example number 17. Once again we go to parameters for this service. And the rule set for this one is called Network interfaces and switch ports, so let's click on that.|
|[0:11:15]||So this screen should now look familiar to you.|
|[0:11:19]||But this time the button says create port specific rule 4. So this would create a rule for a service on host switch1 and on a specific port called 00017. So let's create it. And once again you see some information pre-filled in the conditions area, so you see here the switch1 host and the port is already pre-configured. If I were to uncheck the port specification then this rule would apply to all interface services on the switch1 host. And if I also uncheck this one then this rule would apply to all interface services of all hosts.|
|[0:12:14]||Under values you once again see a lot of options but we will use 'Used bandwidth (minimum or maximum traffic)'. So let's check that box and here you can add multiple elements.|
|[0:12:29]||So you can add an element and then configure if it's for in-or-outgoing traffic for upper or lower values. By selecting lower you could, for example, configure the rule to warn you when there is no traffic on the interface at all, but I will set it for outgoing traffic and upper values. You could also set it for absolute levels in bits or bytes per second but we'll stick to 'Percentule levels'.|
|[0:13:02]||And I will set the warning at 10% and 20%, and critical at 10%.|
|[0:13:18]||So let's save the rule and now activate the changes. But what happens when you create two rules for one service? Well there is a very simple principle in Checkmk the first rule which conditions are met sets the status of the service, all the rules coming after that will simply be ignored. And you can use this principle to create one general rule for all services of a specific type and then create one for an exception. After that you just need to make sure that the exception is placed before the general rule. Let me show you an example. So let's go back to the switch this time we want to create a different rule for interface with port 18. So once again go to parameters for the service.|
|[0:14:22]||You see that there is already one rule in this rule-set. And now let's create a second one, so now I'm going to create another one. But under conditions, I will leave the port 18.|
|[0:14:38]||The only thing I will change is I will set different values again under 'Used bandwidth'. And this time I will once again do outgoing upper values but I will set the values to 80 and 90%. Let's save this. As you can see we now have two rules this color at the beginning of each row indicates which rule is currently being applied. The green color means that the conditions are being met and that it's applying parameters, the yellow color means that the conditions are being met but the parameters are already applied by an earlier rule and this is not what we want because we want our exception to be considered before the general rule. So what we need to do is change the order so we can do that by simply dragging the row here on this arrow icon. So let's move it before the general rule and now you can see that our exception rule is green and it's applying the parameters and our general rule is not being applied. So now we can activate the changes.|
|[0:16:02]||So now let's go back to the switch and look at another service. So let's pick for example this one on port 5. Again go to parameters for this service and let's go back to the rule-set. Now I want to show you a different way you can get to the rules. Not over a specific service but through the setup menu.|
|[0:16:30]||So let's go to the setup menu and in the search box type in "rules". Now you see different categories of rules so you have host monitoring rules, hardware and software inventory rules, the service monitoring rules we've just been working with, service discovery rules agent rules and SNMP rules. But we are looking for a host monitoring rule and namely the 'host check command'.|
|[0:17:01]||We want to find a solution for the following case: let's say you want to check the status of a host but you're unable to ping that host, the normal check to test the availability of a host is a ping. So now let's change that to a TCP connect on port 80. To do this we go to the host check command, you see that there is already a rule here about docker but let's not get into that.|
|[0:17:28]||Let's simply create a new rule. For now, we leave the conditions as is, you could configure a host tag here to only apply this on a specific host but we'll get into host tags in the next part. For now, let's just change the smart ping to a TCP connect on port 80 and let's save that. And then we should be able to use the TCP connect to see whether a host is up or not. Generally speaking, you can say that in Checkmk everything which is configured for hosts and services is based on rules. And now you know how to configure this by yourself.|
|[0:18:13]||In the next part, we're taking a look at host tags and how you can use them to create rules that would apply for example to all your Linux servers at once. That's it for part one, if this was helpful to you please like the video, subscribe to the channel, see you in part two.|
Ep. 1: Installing Checkmk 2.0 and monitoring your first host
In this video, Baris explains how to take get started with Checkmk and start monitoring your first host within a few minutes.
Ep. 2: The Checkmk 2.0 user interface
In this video, Baris take you through the new user interface in Checkmk 2.0. He explains the various components of the User interface such as the new navigation menus, the Sidebar, main dashboard, tactical overview, how to switch between the Checkmk interface themes and much more
Ep. 3: Using SNMP to monitor network devices in Checkmk 2.0
In this episode, Baris explains how to monitor network devices with Checkmk. SNMP is a protocol that many switches, routers, printers, UPSs, hardware sensors and other devices have implemented with the purpose of being able to monitor them easily.
Ep. 4: Monitoring Windows in Checkmk
In this video of our Getting started with Checkmk series, Baris explains how to install a Checkmk agent on a Windows host system and add that into your monitoring environment.
Ep. 5: Using metrics and graphs in Checkmk 2.0
In the 5th episode of the Getting started with Checkmk series, Baris explains using various metrics that you can monitor in Checkmk such as CPU utilization, CPU load etc. You can also see graph visualizations for these metrics or create and customize your own as per your requirements.
Ep. 6: Updating Checkmk 2.0 and using multiple instances
In this video, Baris explains how to update your Checkmk instance. It is very easy and can be done within minutes. You can run multiple Checkmk instances with different versions on the same system. This gives you the flexibility to test the new version before using it in production.
Ep. 7 (part 2): Smart rules with Host Tags in Checkmk
In the second part of this video, Baris explains using Smart rules with host tags in Checkmk. In the first part, he shows you how you can work with rules and set threshold values. These are features that you can use to build your rules even more intelligently and to better organize your monitoring.
Ep. 7 (part 3): Managing Hosts in Folder in Checkmk
In this final part of our episode on Rule-based monitoring in Checkmk, Baris demonstrates how to manage hosts in folders in Checkmk. This helps you to apply your monitoring configurations at scale and organize your hosts according to your needs.
Ep. 8: Working with Host and Service Groups in Checkmk
In this Baris demonstrates how to create host and service groups in Checkmk, so you can perform actions on an entire group instead of configuring each of them individually.
Ep. 9: Using the Quicksearch function in Checkmk
In this episode of the Checkmk tutorials, Baris shows how you can use the Quicksearch function in Checkmk. You can use it to easily find and manage certain hosts or services. He also explains some examples of filters to you. In Checkmk 2.0 you can use the same syntax in the Seach function found in the monitor menu to get identical results.
Ep. 10: Detecting configuration errors with the Analyze Configuration feature
With the Analyze Configuration feature, you can check if there are any configuration errors in your installation. Checkmk controls a number of possible security risks or potential performance restrictions and indicates if there are any problems.
Ep. 11: View creation and customization in Checkmk
In this video, Baris demonstrates how to customize headers, columns, and more in Views in Checkmk for yourself or other users. He also explains how to create custom views and add desired information to these views.
Ep. 12: Acknowledging problems in Checkmk
In this video, Baris explains how you can acknowledge problems in Checkmk. This function helps you to qualify the states of hosts and services. This allows you to keep track of messages in the main dashboard and, for example, you can add comments to problems.
Ep. 13: Scheduling downtimes in Checkmk
In the episode of our Getting started with Checkmk series, Baris explains how you can manage the maintenance times of your systems in Checkmk. Such scheduled downtimes prevent your monitoring from sending false alarms when a host or service goes to WARN or CRIT during maintenance work. You can also inform the users concerned about the maintenance via Checkmk.
Ep. 14: Distributed monitoring with Checkmk
In this video, Baris explains how you can connect several Checkmk instances to a monitoring system and then manage it.
Ep. 15: MKPs and Plugins in Checkmk
In the 15th episode of our Getting started with Checkmk tutorial series, Baris explains what are Checkmk Extension Packages (MKPs) and how easy it is to integrate them into your Checkmk monitoring environment. MKPs are the preferred format when you make your own extensions as it makes it easy to share with other users or deploy in distributed environments.
Ep. 16: Working with 'Bulk Actions' in Checkmk
In this episode of our Checkmk tutorials series, Baris explains how you can save a lot of time with bulk actions. With this feature you can perform various tasks such as deleting, renaming, service discovery etc. on a large number of hosts simultaneously.
Ep. 17: Working with network topologies in Checkmk
In this video of our gettign startted with Checkmk series, Baris explains how to map network topologies in Checkmk. This feature is quite helpful to manage your network and prevent any unnecessary notifications from the devices in your network.
Ep. 18: Creating and customizing dashboards in Checkmk
In this video of our Getting started with Checkmk series, Mathias explains how you can create and customize dashboards in Checkmk 2.0, so you can get insights into your monitoring according to your requirements. Find out more in this video.
Ep. 19: Monitoring websites and their certificates with Checkmk
In this episode, Bastian demonstrates how to monitor a website and its certificate with Checkmk. You can also monitor specific web pages with Checkmk by using the several options that will suit your use case. Learn more in this video.
Ep. 20: Configuring dashboard elements in Checkmk
Learn how to add data visualization elements of the various metrics into your Checkmk Dashboard. In this video, Mathias explains how you can configure these elements and create a dashboard as per your requirements.
Ep. 21: Setting up notifications in Checkmk
Learn how to set up notifications in Checkmk and assign relevant contacts and contact groups to be notified for various events. Later in this video, our presenter Bastian also demonstrates how you can set up rule-based notifications according to different conditions for hosts and services.
Ep. 22: Monitoring logfiles with Checkmk
Monitor your logfiles with Checkmk using its Logwatch plugin. It is very useful when you want to monitor your logfiles regardless of whether you are using a UNIX/Linux or a windows based system. Learn more in this video.
Ep. 24: 3 Rules for efficient network monitoring
In this video, Bastian demonstrates 3 rules that will help you to efficiently monitor your network interfaces. With Checkmk 2.0, with just three rules, you can set up an efficient network monitoring that will not only monitor all of your network interfaces but also simultaneously provide a detailed overview of all of your ports.
Ep. 25: New UX and security improvements in Checkmk 2.1
Checkmk 2.1 come with many UX improvements such as pre-built dashboards for Linux and Windows, faster core performance and much more. Security features such as two-factor authentication etc. were also added in this new version. Watch this video to learn how to use these new features and enhancements in Checkmk.
Ep. 28: Working with InfluxDB integration in Checkmk
Learn how to send data to InfluxDB from Checkmk. As InfluxDB introduced a new protocol to send data to it, a new connector was developed with Checkmk to talk natively with it. Learn more about it in this video.
Ep. 29: New agent architecture in Checkmk 2.1
With Checkmk 2.1, the agent architecture was modified to enable performance improvements and add new features such as TLS encryption, data compression, and the reversal of direction of communication from the agent. This will enable push mode and pull mode.