Ep. 48: Monitoring file servers with Checkmk
[0:00:00] | Hello my name is Bastian Kuhn and today I'm going to show you some options of the filesystem check. |
[0:00:16] | As an example, I have a normal file server. This file server has a lot of file systems and what I'm going to show you here is going to work for every check which has this name. It doesn't matter if it's a file server, a normal server agent based or SNMP based device. |
[0:00:39] | The first thing I notice here is that the default level is 80.00%/90.00% from Checkmk are not matching for all the file systems. Some of them are green or the thread or yellow which means they are on warning. First thing is to adjust this file system levels. |
[0:01:03] | And of course, I don't want to put that much effort in it. First, let's find out how to do it. The easiest way is to use the burger menu. This burger menu has the option Parameters for this service and this leads you basically to every threshold rule for the recording service. On this page here, you can see the threshold rule is Filesystem (used space and growth). |
[0:01:32] | It's currently using the Default value and you even see here what the default values are. You now click on the name. And here I can create a rule. |
[0:01:46] | It's basically the same, like for every other rule. I have my Rule Properties, don't need to set it up. So, in my case, I hide it anyway. Then here directly I have the thresholds and finally here the Conditions. If I would create a rule without any condition, it's going to be a default rule for every file system. So, let's go with the levels. That's the most simple. |
[0:02:17] | One option here would be just set in higher value. And I save this, activate it, and check again. It takes a moment after the next three check. I can just force it here. And now I see my file systems are on green. The only problem I have here is the presented percentage threshold for file systems always depends on the size of file systems. |
[0:03:23] | So, 95% for the small file system is something different, like 95 percent for a really big file system. But we do have some options to handle this situation. If I go back to the rule, you see how the page has changed. Now that we have one rule, we have the overview again. We can directly changing the rule here but I'm gonna go with the overview. So, to prevent the situation one option is that we can set levels in a dynamic way like Dynamic levels. |
[0:04:06] | Then here we can add multiple elements. And for each element we can choose on which file system size this level should apply. For example, or Filesystem larger than 1GiB we can use 90%, 95%. And Filesystem larger than 100GiB, you can use 98%, 99%, and so on. But this is not really dynamic either. There's another option I want to show you. |
[0:04:49] | Let's not use the Dynamic levels. Just set them back to a default value and instead let Checkmk automatically adapt to file system levels based on the file system size. For that, we have the option Magic factor. Depending on the matching factor, these file system levels will adapt bigger or smaller. |
[0:05:14] | So, basically if you have a big file system, the levels will crawl. If you have a small file system, the levels will go lower. One problem we have with that is, of course, in really small file systems suddenly we're gonna have like, arrows. I can show you what I meant. Let's go to 0.6 is the smallest possible value. |
[0:05:47] | Just see here the possible values and we just add a page here. Then I'm gonna save it, activate it. And we check the host again. I trigger again in order not to wait. We can see that some of the smaller file systems went back into a problem state. |
[0:06:32] | Since the new magic levels, although not magic, but based on the Magic factor, I know smaller than the 95% 98%. And you see how the levels adapted based on the file system size everywhere. One possibility to address that then is to set the minimum levels. So, if we go to the rule back again. |
[0:07:03] | Gonna edit and just set minimum levels if using Magic factor. And maybe it's a good way to set 95%, 98% but this always depends on your environment. And I save it again, I activate it. I activated. |
[0:07:35] | Then I check again. And now everything should be fine. But, of course, we have some ways more to apply different thresholds on different services. We can, for example, also use the file system mount points, like here oracle. Let's say we want to do that, so I go back to the parameters and I gonna create a second rule. |
[0:08:22] | This time, I can, for example, go with a different level. Let's say we want just 90% 92%. and this time only for everything with /oracle. After saving this rule, we can see a special behavior. We are currently in the context of fileserver and the mount point /oracle. |
[0:08:52] | And these little indicators here show that this rule is matching. This is also a matching rule but all of this parameters are overwritten by a previous rule. What happened here if a rule has multiple parameters, you can overwrite them if you put a rule before the other rule. So, for example, if we move up this rule, we overwrite the Levels for filesystem. |
[0:09:27] | For every mount point beginning with /oracle. But we'd still get all the other options, like the Magic factor and the Minimum levels. This way, you are able to set, like for example, one default rule defining, let's say, the Magic factor and to match new default levels. But still you are able to override for every type of mount point, these levels again. And next thing here the condition is Mount point begins with /oracle. |
[0:10:09] | Let's say, you want to have a condition only for the /filesystem. For that, you're going to go Add rule, just for the example, just again use the levels. I enter as Mount Point / but this would match everything starts with /. So, I just gonna add a $. Since here we can use regular expressions. I save it. |
[0:10:39] | And now I have a rule only matching for a /. But, again, this rule needs to be moved up to overwrite without conditions because the without condition rule will always match first and overwrite the levels. So, up for this rule. But there are even more options you can set thresholds on. |
[0:11:03] | I'm going to show you in a moment. I create another rule. And the next thing you can do, you can set levels based on a time range. So, I can say I want to have a time range for 24 hours. And in this range we can set the trend in MiB in percentural growth. |
[0:11:30] | On and the same for decreasing, or another really useful option is to set levels on the time left until the file system is full. Based on the trend again, you can get in the warning if it only takes 12 hours until the file system is full, on error takes 6 hours. |
[0:11:57] | This was just some of the options you can do with Checkmk. I really hope it was helpful. Then don't forget to subscribe and see you next time. |
Want to know more about Checkmk? Join us for our Introduction to Checkmk Webinar