As explained in the previous article, most IT operations teams err on the side of caution when it comes to alerting. In other words, they are optimizing to keep Type II errors low, therefore increasing the number Type I errors they incur.
But that caution comes at a cost. We’ve already briefly explored the direct cost of needing to sift through tens of thousands of alerts and discarding the large majority of them as meaningless. We are talking about hundreds of hours of work by highly trained – and therefore expensive – specialists.
Let’s revisit the example from one of our customers: Their IT Ops team received between 8,000 to 10,000 alerts every month from their legacy monitoring system. Most of those were false alerts.
When we talked to the people from that team, the sheer load was only one of their problems. The words used to describe the process of verifying the alert were ‘mind-numbing’, ‘messed up’, ‘time eaters’, or ‘totally senseless’.
Teams get drained from alerting process
I think one gets the point. People were suffering from acute alert fatigue. Alert fatigue describes a phenomenon where workers become desensitized to safety alerts, and as a result ignore or fail to respond appropriately to such warnings. The effect is well known in healthcare, but also in construction, mining, or, worryingly enough, nuclear power plants. But alert fatigue also exists in IT Ops, Network Operations, and SOCs around the world.
The impact of alert fatigue on IT Ops teams is severe: If most of your time is spent verifying (probably meaningless) alerts, then less time is spent doing interesting things that you would actually like to be doing. Over time, this situation can wear down a team’s morale. People will start to look for other challenges and even resign, leaving the company shorthanded and having to spend lots of time and money to find and train a replacement.
On top of that, there’s a high opportunity cost: When the Ops team is overwhelmed and drained by the alerting process, they are unable to innovate and improve the IT infrastructure and platforms. Because they’re only responding to (again: probably meaningless) alerts, they’re not able to explore better systems, infrastructure automation, or actively eliminate root causes to prevent future problems. Over time, this will add to technical debt as problems are never properly addressed and lasting fixes are not implemented.
The alternative is even worse. It’s simply turning off or ignoring alerts. Everyone knows you shouldn’t do that. Nevertheless it happens all the time.
False alarms in the night
Just imagine you’re the sysadmin on call and you get alerted in the middle of the night. And when you wake up, you already know that there’s an 80% chance that it’s a false alarm. What would you do?
Well, the reaction is only natural. According to research by IT security firm Critical Start, almost 40% of interviewed operations professionals admitted to ignoring certain categories of alerts. So much for eliminating false negatives, right?
But surely a rate of 80% false alerts seems a bit high, doesn't it? Don’t be too sure. According to the same study, almost half of all respondents reported a rate of false positives in excess of 50%! No wonder people start to tune out. Some of the stories our customers tell also confirm these very high rates of false alerts.
So, what’s the solution? Let’s look at that customer from our first article again.
When the customer introduced Checkmk with the help of our partner SVA, they took considerable care in improving their alerting procedures. Using the various tools Checkmk put at their disposal, the number of monthly alerts dropped to approximately 2,000. This is a reduction of alerts by 75-80% without any adverse effect on the quality of service! Every month, that alone saves the customer hundreds of hours of sifting through meaningless alerts, and easily justified their investment in the Checkmk Enterprise Edition subscription.
In the third part of this series, we will next take a look at the various tools our partners and customers – and you – can use to create better alerting and notifications.