Ep. 12: Acknowledging problems in Checkmk
Note: All the videos on our website offered in the German language have English subtitles and transcripts, as given below.
[0:00:00] | Welcome back to the Checkmk-Channel.Today we will be looking at problem acknowledgement. |
[0:00:16] | Acknowledging problems – what is this all about? So, what actually is a problem? Well, when a host or service is not okay, that is a problem. |
[0:00:22] | For example, a host that is DOWN, or a service that is on WARN, CRIT or UNKNOWN – this is what you call a problem. Now, Checkmk distinguishes between two types of problems. Namely, those that are not being handled, and those that are being handled. |
[0:00:35] | So, in principle, being handled means that the problem is known, and that someone is already taking care of it. And acknowledging of problems simply ensures that a problem is afterwards considered to have been dealt with. |
[0:00:45] | This then has three effects: The first is that you see an icon in the interface overview. The second is that there may be no further notifications about the problem, and the third is that the problem disappears from some views, for example in the 'Tactical Overview' it no longer appears under 'Problems' – under 'Unhandled', instead only in the 'Handled Problems'. |
[0:01:05] | And there are also other problem views where these items have dropped out, so that in the operation you always have an overview of the new, currently unknown, or unresolved problems. |
[0:01:17] | I will now show you exactly how this works. Acknowledging problems works via commands. To execute a command for a host or service, first I have to select the host or service. I then go for example into the 'Tactical Overview', and through the field 'Problems', and there I can see that I have 6 current problems. |
[0:01:37] | So I now select the first item, the filesystem 'root' in 'mycmkserver', and now I go to the top of the screen to this little hammer, which then shows all of the available commands. And what I need is also the very first item. |
[0:01:51] | I enter a text into the comment field: "Disk is ordered" for example, and now here just press 'Acknowledge' to confirm the whole procedure, and now first you see this new icon, which means that the problem has been detected, and the second icon right next to this is the comment that I just entered. |
[0:02:16] | If you now reload the sidebar or just wait a bit, you will see that the number of 'Unhandled Problems' is now 1 fewer, because this problem is now considered to be "handled", and therefore it has been removed from the list. You can also acknowledge several problems at once. |
[0:02:33] | And that is also very simple, for example I could just say that these 5 unhandled problems you have here, I acknowledge them all together since they all belong together here as they are all CPU load, so if I now go to the hammer here in the list and execute a command, it will automatically apply to all items in the list. |
[0:02:55] | Here you can also see "Do you really want to acknowledge the problems on the following 5 services?" So it concerns these 5 services. If I now confirm here, that I have acknowledged all 5 of them and, as you can see, they have been dropped from 'Unhandled Problems', so now there are no 'Unhandled Problems', I now only have – of course I still have my 6 problems, but these have all been acknowledged. |
[0:03:18] | Whatever you do, you must always be able to undo it, in this case I also might possibly want to remove an acknowledgement. This is actually performed in the same way: I select the items that I want to remove – for example all of them – go back and select the hammer, and now you will find this button on the right side, "Remove Acknowledgement", here you don't need to enter a comment, simply press 'Remove', click 'Yes', return to the view, and you will see that all of the acknowledgement symbols have now disappeared. |
[0:03:48] | So now the next question is, of course: Maybe I don't want to edit the whole list here at once, rather only some of them. |
[0:03:56] | There are basically two possibilities: The first one is to use filters – which you can find up here – to limit the selection, but you can also activate the checkboxes – with this little button here – then every line will get a checkbox, and now I can simply select which services I want to perform the command on, for example these 3, then go back to my hammer, and select 'Acknowledge'. |
[0:04:19] | I simply enter "Test" here, and you can now see that is only a matter of the 3 services which are displayed here. Press 'Yes', and go back to the view. By the way, there are two links here. I can go back to the view with the checkboxes reset, or I can go back so that the checkboxes remain selected because I might want to perform another command on these objects. |
[0:04:47] | As you might have noticed, there are some interesting options in the panel for the acknowledgement command that I have not yet explained. I want to make up for that now, because there are some very interesting features. So I go back to the hammer, and we will have another look at this panel. |
[0:05:03] | For this I'll close the sidebar so that we can see the panel better. So, there are three checkboxes here, and there is also time information here. Maybe we should start with this. You can select 'Expire Acknowledgement after', and then specify days, hours and minutes. |
[0:05:17] | This means that such an acknowledgement now has an expiration time. For example, you could say that the problem should normally be resolved within one day. If you set this acknowledgement to a one day expiration time, and then if after one day the problem still exists, the acknowledgement disappears and the problem reappears as unhandled because possibly nobody has really taken care of it. |
[0:05:37] | However, this is a feature that is only available in the Enterprise Edition. Then there is also the checkbox 'sticky' – this is a little more subtle. It could occur, for example, that a problem is in the WARN state, and then later goes to the CRIT state. |
[0:05:56] | The question now is what happens with an acknowledgement. For example, the problem that a file system is in the WARN state because it is 80 percent full, and now the question is: If you want that if it goes to CRIT after that, that it will then again be regarded as unhandled – as a new problem – then this sticky checkbox should not be set. |
[0:06:17] | If this is set, it would mean that this problem would remain acknowledged until it is OK again. 'Send notification' is enabled by default, and simply means that a notification will be triggered. |
[0:06:29] | We will be making a separate video on the subject of notification, but with the normal setting this would simply mean that the people responsible would receive an email that this problem has been acknowledged. |
[0:06:39] | And the 'persistent comment' – if I check this option, the comment attached to the service will remain even if the acknowledgement disappears again. Usually they are connected in such a way so that when the state goes back to OK, for example, as well as the service the comment also disappears again. |
[0:06:56] | But this way you can effectively pin the comment to it, and would have to manually remove the comment later when you don't need it anymore. |
[0:07:04] | So, that was everything on the subject of acknowledgements. I hope you enjoyed it and that we will meet again for the next episode on the topic of Downtimes. |
Want to know more about Checkmk? Join us for our Introduction to Checkmk Webinar