In statistics, we must always deal with two types of errors, typically referred to as ‘Type I’ and ‘Type II’. A Type I error is, in statisticians’ terms, ‘the mistaken rejection of the null hypothesis’. Or, in layman’s terms – a false positive. A false positive would be a covid test returning a positive result for a healthy person, or a legitimate email being flagged as spam, or an innocent person being found guilty of a crime.
The Type 2 error is – hardly surprisingly – the exact opposite of Type I: the ‘mistaken acceptance of the null hypothesis’, also known as a ‘false negative’. In this case, the guilty person goes free, or the spam mail from that Nigerian prince actually does make it into your inbox.
In monitoring, we also have to deal with both false positives and false negatives. A false positive would be when the monitoring tool alerts about an issue when in reality the system being monitored is perfectly fine. Examples of this are manifold: a server being shown as DOWN because there was a short glitch in the network connection, or a short spike in a network device’s used bandwidth triggering a critical alert, when 5 seconds later everything is back to normal again.
False negatives are when your monitoring system does not alert you when in fact there really is a problem. If your firewall is down, you want to know about it. If your monitoring system for some reason does not alert you of this, you can get into real trouble, really quickly.
Where are the acceptable levels for errors?
The tricky part in statistics is – you can’t eliminate both Type I and Type II errors completely. It’s mathematically impossible. The only thing you can do is optimize for an acceptable level of one as well as the other. Optimizing too far towards eliminating Type I errors will increase your Type II errors and vice versa. Where those acceptable levels for the errors actually lie depends on the particular situation.
The same is true in monitoring. The problem is that when running an enterprise IT infrastructure, the costs of a false positive or a false negative can be vastly different. A false negative could mean that a mission-critical system is down and not alerting. A false positive might just be one unnecessary alert that can be quickly deleted from your inbox.
So, when IT Ops teams try to determine the acceptable level of false positives versus an acceptable level of false negatives, they will oftentimes deem more false positives much more acceptable than risking a few more false negatives. This is the reason why many IT Ops teams will err on the side of caution and answer our title question – to notify or not to notify? – with: Notify. This is totally understandable.
More than 300 alerts every day
The consequence, however, is that these teams get drowned in meaningless alerts. One tribe29 customer migrated to Checkmk from a system that sent roughly 10,000 alerts per month to the Ops team. That’s more than 300 alerts every day – or roughly one every four minutes, assuming 24/7 operations. You could go for lunch and when you come back, the entire first page of your inbox is filled with new alerts.
Unfortunately, most of these alerts were false positives. The monitoring tool's notification system was both relatively inflexible and had also probably not been set up in an optimal way. The result was a team being barraged with alerts, most of which they knew were probably meaningless.
Still, because there’s always that nagging risk of missing something important, someone needs to go through all of these alerts to check if any of them is an actual problem, or a false positive. As you can imagine, this takes a lot of time, and it’s not the most fun activity in the world either.
Staggering costs through false positives
The cost of this can be staggering. If we assume it takes just two minutes to verify whether an alert is legitimate or not, in the customer case mentioned here, they spent about 20,000 minutes every month just verifying alerts. That’s more than two full-time roles just for that task, and that does not include actually fixing any real problems (or investigating the root cause of the false alert and trying to mitigate it).
But the cost is often not only internal. Another customer of ours, a European railway operator, uses Checkmk to monitor their internal communications network. This is the system through which the railway operators, the stations, and the traffic controllers communicate. This system is absolutely mission-critical for the customer. Every alert needs to be investigated quickly.
Unfortunately, the previous monitoring system produced a lot of false alerts. The company used an external field service provider to fix problems in the field. But of course, this service provider also invoiced them for investigating the false alerts. Once the monitoring and alerting was replaced by Checkmk, the total cost of this dropped by more than 65,000 Euros every year because there were no more false alerts! Spoiler alert: Checkmk costs the customer significantly less than those 65,000 Euros per year.
In a time of tight budgets and a desperate shortage of skilled technical staff, no company can afford to spend that much time or money on activities that add so little value. This is one reason why we focus so much on getting notifications right when we help customers implement Checkmk.
The other reason we will explore in our next article: It’s the hidden cost of false alerts. In the third part of this series, we will take a look at the various tools you can use to create better alerting and notifications.