Good IT monitoring stands and falls with its precision. It must inform you at the right time when something is wrong. But similar to statistics, we also have to deal with both false positives and false negatives in monitoring.
A false positive would be the monitoring tool alerting about an issue when in reality the monitored system is perfectly fine. It could be a server being shown as DOWN because there was a short glitch in the network connection, or a short spike in a network device’s used bandwidth triggering a critical alert, when 5 seconds later everything is back to normal again.
False negatives are when your monitoring system does not alert you, although something really is wrong. If your firewall is down, you want to know about it. If your monitoring system for some reason does not alert you about this, you can get into trouble, really quick.
And similar again to statistics, you can’t eliminate errors completely in your monitoring. The problem however is that when running an enterprise IT infrastructure, the cost of a false positive and false negative can differ vastly. A false negative could be a mission critical system down and not alerting. A false positive might just be one unnecessary notification that’s quickly deleted from your inbox.
Hence, when IT Ops teams try to determine the acceptable level of false positives versus an acceptable level of false negatives, they will often deem false positives more acceptable. They will err on the side of caution and notify, which is totally understandable. The consequence, however, is that these teams get drowned in meaningless alerts, which increases the risk of overlooking a critical one.
In this article, I will present some simple ways in which you can fine-tune the notifications from your monitoring system, so that you’ll not get drowned in notifications alerts and ideally receive only those that are really relevant. Because notifications will only be of help when no — or only occasional — false alarms are produced.
I am working with Checkmk, so my tips will be making use of features available in Checkmk. Some of them may or may not be available in your monitoring tool of choice.
1. Don’t alert. Seriously.
In Checkmk, notifications are actually optional. The monitoring system can still be used efficiently without them. Some large organizations have a sort of control panel in which an ops team is constantly monitoring the Checkmk interface, and thus additional notifications are unnecessary. These are typically users that can’t risk any downtime of their IT at all, like a stock exchange for example. They use the problem dashboards in Checkmk to immediately see the issue and its detail. As the lists are mostly empty, it is pretty clear when something red pops up on a big dashboard.
But in my opinion, this is rather the exception. Most people use some way of notifying their ops and sysadmin teams, be it through email, SMS or notification plugins for ITSM tools such as ServiceNow, PagerDuty or Splunk OnCall.
A temporary version of this tip is to not alert when you’re still setting up your monitoring for the first time. Start monitoring, fix detected issues, then turn on notifications step by step. This way, you can avoid generating a lot of alerts that can overwhelm your team. Bring your infrastructure in a proper shape first before you set up alerts in your new monitoring tool.
2. Give it time
So if you’ve decided you don’t want to go down the ‘no notifications’ route from my previous point, you need to make sure that your notifications are finely tuned to only notify people in case of real problems.
The first thing to tell your monitoring is: Give it time.
Some systems produce sporadic and short-lived errors. Of course, what you really should do is investigate and eliminate the reason for these sporadic problems, but chasing after all of them as they happen could be futile.
You can reduce alarms from systems like that in two ways: You can simply delay notifications to be sent only after a specified time AND if the system state hasn’t changed back to OK in the meantime. Alternatively you can use a Maximum number of check attempts for service rule set to tell the monitoring to only notify you, if the problem persists for several check intervals (the default Checkmk interval is 1 minute, but you might configure this differently).
The two options are slightly different in how they treat the monitored system, but that would lead us too far down the Checkmk rabbit hole. The essential effect is: By giving a system a bit of time to ‘recover’, you can avoid a bunch of unnecessary notifications.
This method also works great for ‘self-healing’ systems that should recover on their own; you wouldn’t want to get a notification for this.
Of course, this is not an option for systems that are so mission-critical that you can’t afford ANY downtime and need to react immediately. For example, a hedge-fund that monitors the network link to a derivative marketplace can't trade if it goes down. Every second of downtime could cost them dearly.
3. On average, we don’t have a problem
Notifications are often triggered by threshold values on utilization metrics (e.g. CPU utilization) which might only exceed the threshold for a short time. As a general rule, such brief peaks are not a problem and should not immediately cause the monitoring system to start notifying people.
For this reason, many check plug-ins have the configuration option that their metrics are averaged over a longer period (say, 15 minutes) before the thresholds for alerting are applied. By using this option, temporary peaks are ignored, and the metric will first be averaged over the defined time period and only afterwards will the threshold values be applied to this average value.
4. Like parents, like children
Imagine the following scenario: You are monitoring a remote data center. You have hundreds of servers in that data center working well and being monitored by your monitoring system. However, the connection to those servers goes through the DC’s core switch. Now that core switch goes down, and all hell breaks loose. All of a sudden, hundreds of hosts are no longer being reached by your monitoring system and are being shown as DOWN. Hundreds of DOWN hosts mean a wave of hundreds of notifications…
But in reality, all those servers are doing just fine. It’s just that one core switch which is acting up. So what do you do about it?
You need to tell your monitoring system that all these servers are dependent on that core switch. You can do so in Checkmk by using ‘parent-child-relationships’. By declaring host A as the ‘Child’ of another ‘Parent’ host B, you tell your Checkmk system that A is dependent on host B. Checkmk pauses notifications for the children if the parent(s) is down and marks them as UNREACH.
Note: A child host can have more than one parent host. A child host becomes only unreachable when all of its parent hosts are down.
5. Avoid alerts on systems that are supposed to be down
There are hundreds of reasons why a system should be down at times. Maybe some systems need to be rebooted regularly, maybe you are doing some maintenance or simply don’t need a system at certain times. What you don’t want is your monitoring system going into panic mode during these times, alerting who-knows-whom if a system is supposed to be down.
To do that, you can use Scheduled Downtimes. During a scheduled downtime, Checkmk will not alert for that system, although it will continue to monitor it. It’s as simple as that. It’s also possible to schedule recurring downtimes if you’re using a commercial Checkmk Edition. So, if you know that a certain system is rebooted every night between midnight and 2 in the morning, you can set a recurring downtime for that time window and never be wrongly notified when this host is down for a short while during that time.
Scheduled downtimes work for entire hosts, but also for individual services. But why would you send certain services into scheduled downtimes? More or less for the same reason as hosts – when you know something will be going on that would trigger an unnecessary notification. You still might want your monitoring to keep an eye on the host as a whole, but you are expecting and accepting that some services might go haywire and breach thresholds for some time. An example could be a nightly CronJob that syncs data somewhere, causing some service like Disk I/O to spike. But, if everything goes back to normal once the sync is through, no need to lose sleep over it.
Moreover, you can extend scheduled downtimes to ‘Children’ of a ‘Parent’ host as well.
Wrapping up
I hope this short overview has given you some ideas about really simple ways with which you can cut down on the number of meaningless notifications your ops team is getting from your monitoring system. Many of the features I described are not exclusive to Checkmk but also available in other monitoring tools. There are many more ways, but this should get you started quite nicely.
Fine-tuning your alerts is one of the most important, but also most rewarding activities when it comes to configuring your monitoring system. The impact of a well-defined notification setup will be felt immediately.
If you want to learn more about how to manage your notifications in Checkmk, check out this docs article or post a question in the forum.