Ep. 12: Acknowledging problems in Checkmk
Note: All the videos on our website offered in the German language have English subtitles and transcripts, as given below.
Read Video Transcript
|[0:00:00]||Welcome back to the Checkmk-Channel.Today we will be looking at problem acknowledgement.|
|[0:00:16]||Acknowledging problems – what is this all about? So, what actually is a problem? Well, when a host or service is not okay, that is a problem.|
|[0:00:22]||For example, a host that is DOWN, or a service that is on WARN, CRIT or UNKNOWN – this is what you call a problem. Now, Checkmk distinguishes between two types of problems. Namely, those that are not being handled, and those that are being handled.|
|[0:00:35]||So, in principle, being handled means that the problem is known, and that someone is already taking care of it. And acknowledging of problems simply ensures that a problem is afterwards considered to have been dealt with.|
|[0:00:45]||This then has three effects: The first is that you see an icon in the interface overview. The second is that there may be no further notifications about the problem, and the third is that the problem disappears from some views, for example in the 'Tactical Overview' it no longer appears under 'Problems' – under 'Unhandled', instead only in the 'Handled Problems'.|
|[0:01:05]||And there are also other problem views where these items have dropped out, so that in the operation you always have an overview of the new, currently unknown, or unresolved problems.|
|[0:01:17]||I will now show you exactly how this works. Acknowledging problems works via commands. To execute a command for a host or service, first I have to select the host or service. I then go for example into the 'Tactical Overview', and through the field 'Problems', and there I can see that I have 6 current problems.|
|[0:01:37]||So I now select the first item, the filesystem 'root' in 'mycmkserver', and now I go to the top of the screen to this little hammer, which then shows all of the available commands. And what I need is also the very first item.|
|[0:01:51]||I enter a text into the comment field: "Disk is ordered" for example, and now here just press 'Acknowledge' to confirm the whole procedure, and now first you see this new icon, which means that the problem has been detected, and the second icon right next to this is the comment that I just entered.|
|[0:02:16]||If you now reload the sidebar or just wait a bit, you will see that the number of 'Unhandled Problems' is now 1 fewer, because this problem is now considered to be "handled", and therefore it has been removed from the list. You can also acknowledge several problems at once.|
|[0:02:33]||And that is also very simple, for example I could just say that these 5 unhandled problems you have here, I acknowledge them all together since they all belong together here as they are all CPU load, so if I now go to the hammer here in the list and execute a command, it will automatically apply to all items in the list.|
|[0:02:55]||Here you can also see "Do you really want to acknowledge the problems on the following 5 services?" So it concerns these 5 services. If I now confirm here, that I have acknowledged all 5 of them and, as you can see, they have been dropped from 'Unhandled Problems', so now there are no 'Unhandled Problems', I now only have – of course I still have my 6 problems, but these have all been acknowledged.|
|[0:03:18]||Whatever you do, you must always be able to undo it, in this case I also might possibly want to remove an acknowledgement. This is actually performed in the same way: I select the items that I want to remove – for example all of them – go back and select the hammer, and now you will find this button on the right side, "Remove Acknowledgement", here you don't need to enter a comment, simply press 'Remove', click 'Yes', return to the view, and you will see that all of the acknowledgement symbols have now disappeared.|
|[0:03:48]||So now the next question is, of course: Maybe I don't want to edit the whole list here at once, rather only some of them.|
|[0:03:56]||There are basically two possibilities: The first one is to use filters – which you can find up here – to limit the selection, but you can also activate the checkboxes – with this little button here – then every line will get a checkbox, and now I can simply select which services I want to perform the command on, for example these 3, then go back to my hammer, and select 'Acknowledge'.|
|[0:04:19]||I simply enter "Test" here, and you can now see that is only a matter of the 3 services which are displayed here. Press 'Yes', and go back to the view. By the way, there are two links here. I can go back to the view with the checkboxes reset, or I can go back so that the checkboxes remain selected because I might want to perform another command on these objects.|
|[0:04:47]||As you might have noticed, there are some interesting options in the panel for the acknowledgement command that I have not yet explained. I want to make up for that now, because there are some very interesting features. So I go back to the hammer, and we will have another look at this panel.|
|[0:05:03]||For this I'll close the sidebar so that we can see the panel better. So, there are three checkboxes here, and there is also time information here. Maybe we should start with this. You can select 'Expire Acknowledgement after', and then specify days, hours and minutes.|
|[0:05:17]||This means that such an acknowledgement now has an expiration time. For example, you could say that the problem should normally be resolved within one day. If you set this acknowledgement to a one day expiration time, and then if after one day the problem still exists, the acknowledgement disappears and the problem reappears as unhandled because possibly nobody has really taken care of it.|
|[0:05:37]||However, this is a feature that is only available in the Enterprise Edition. Then there is also the checkbox 'sticky' – this is a little more subtle. It could occur, for example, that a problem is in the WARN state, and then later goes to the CRIT state.|
|[0:05:56]||The question now is what happens with an acknowledgement. For example, the problem that a file system is in the WARN state because it is 80 percent full, and now the question is: If you want that if it goes to CRIT after that, that it will then again be regarded as unhandled – as a new problem – then this sticky checkbox should not be set.|
|[0:06:17]||If this is set, it would mean that this problem would remain acknowledged until it is OK again. 'Send notification' is enabled by default, and simply means that a notification will be triggered.|
|[0:06:29]||We will be making a separate video on the subject of notification, but with the normal setting this would simply mean that the people responsible would receive an email that this problem has been acknowledged.|
|[0:06:39]||And the 'persistent comment' – if I check this option, the comment attached to the service will remain even if the acknowledgement disappears again. Usually they are connected in such a way so that when the state goes back to OK, for example, as well as the service the comment also disappears again.|
|[0:06:56]||But this way you can effectively pin the comment to it, and would have to manually remove the comment later when you don't need it anymore.|
|[0:07:04]||So, that was everything on the subject of acknowledgements. I hope you enjoyed it and that we will meet again for the next episode on the topic of Downtimes.|
Ep. 1: Installing Checkmk & monitoring your first host
In this video Mathias wants to show you how easy it is to get started with Checkmk. Together we will install Checkmk and add some hosts into the monitoring system.
Ep. 2: The Checkmk user interface and sidebar
Mathias would like to show you the Checkmk user interface, and the simplest way for you to use it.
Ep. 3: Monitoring Windows
In today's tutorial we'll learn how to install the Checkmk agent for Windows and how to add a Windows server into monitoring system.
Ep. 4: Using SNMP to monitor network devices
SNMP is a protocol implemented for many switches, routers, printers, USVs, hardware sensors and many other devices which enables them to be easily monitored.
Ep. 5: Using metrics and graphs in Checkmk
Checkmk collects many measurement values that are important for IT monitoring and enables you to use this data for various purposes.
Ep. 6: View creation and customization in Checkmk
How do you edit a view's headline, columns and more in Checkmk? In this video tutorial you'll learn the basics of creating and customizing views.
Ep. 7: Updating Checkmk & using multiple instances
In this video Mathias shows you how to update a Checkmk system to a new version and how to run multiple Checkmk instances on a server at the same time.
Ep. 8 (part I): Working with rules & setting thresholds
In this video, Mathias explains how you can work with rules and set thresholds. First of all we will explain how to set parameters of services.
Ep. 8 (part II): Smart rules with Host Tags
In this video, we‘ll talk about host tags. These are features you can use to build your rules more intelligently and organize your monitoring more efficiently.
Ep. 8 (part III): Managing Hosts in Folders
In this video, Mathias shows how to manage hosts in folders. This allows you to fine-tune your configuration and save valuable time.
Ep. 9: Working with Hosts and Service Groups
In this video, Mathias explains how to create host and service groups in Checkmk. This will give you focused views of your hosts and services.
Ep. 10: Using the Quicksearch function in Checkmk
In this video Mathias shows you the Quicksearch function of Checkmk. With it you can find and manage your hosts or services easily. He also explains some examples of filters.
Ep. 11: Detecting configuration errors with the Analyze Configuration feature
With the Analyze Configuration module in the WATO, Checkmk can check whether it is configured correctly. Checkmk controls a number of possible security risks or potential performance restrictions and indicates any problems.
Ep. 13: Working with Schedule Downtimes
In this video Mathias explains how you can acknowledge problems in Checkmk. The function helps you to qualify states of hosts and services. This helps you keep the overview of messages in the "Tactical Overview" and provides a few additonal functions like adding comments to problems, for example.