Ep. 27: Detecting issues and configuring alerts for Kubernetes clusters

To load this YouTube video you are required to accept advertising cookies.

[0:00:00]	In this video, we will show you how to detect issues in your Kubernetes clusters and configure alerts.
[0:00:14]	Sooner or later in every Kubernetes cluster there are problems, their performance problems, their capacity constraints.
[0:00:25]	There are just workloads which are not working as expected. And for that, you need monitoring.
[0:00:30]	In this video, I will first show you how the new Kubernetes monitoring looks like, how you can discover issues there. We will also resolve 1 issue which we find.
[0:00:40]	And then I will show you how to configure your own alerts easily. Let's now have a look.
[0:00:48]	Okay, let's start by looking at our Kubernetes Cluster dashboard.
[0:00:53]	This is your entry point into your Kubernetes monitoring and it provides you an overview of the health of your cluster.
[0:01:00]	And before we go into the actual alerting, let's try to understand how actually healthy is my cluster.
[0:01:05]	On the right part here, I can see the CPU resources, the Memory resources, and the Pod resources of my cluster.
[0:01:14]	I can see how much usage is actually on CPU, on the cluster, on the total, what are the requests, what are the limits and how much is actually allocatable.
[0:01:27]	Yeah, you can see that over time here. You can also see the entire utilization same for memory utilization. And I can also see here, okay, how many pods actually running. I can see that 3 are pending, 1 is free.
[0:01:44]	And they have a total of 51 which I can allocate in this cluster, which leads me to the first alert which I see here, which is this big red thing here, which tells me, hey, there's only one free pod in my cluster, which is not too great.
[0:02:03]	Below these performance metrics, I can actually see here what are the problems in my cluster.
[0:02:09]	There's a table of all the issues which your cluster has. You can see, hey, there's only 1 free pod on the entire cluster, a couple of nodes have actually no pods free at all.
[0:02:21]	I have here a deployment which seems to have a couple of problems and so on. On the left part, on top I can actually first see, okay, some status information, do I actually get data, is the API healthy.
[0:02:37]	I can see the health of my nodes. It doesn't look too great there as well and I can see below DaemonSets, StatefulSets, and Deployments.
[0:02:44]	So, you can see workloads in my cluster. I can see how much is being consumed for CPU memory.
[0:02:52]	For each of the workloads, I can see who is consuming a lot. So, here this deployment is consuming the most in my cluster and it also has a problem, yeah.
[0:03:02]	We can take a look at that by clicking here on the name of the deployment, which will lead us to the dashboard for the deployment, which is very easy to understand what's happening here now.
[0:03:14]	So, we have again some performance metrics, what is the CPU usage of that specific deployment, and overall so the sum of all the pods in it.
[0:03:25]	We can see also the individual pods here to the very right. You can see, okay hey, there was a dip a couple of moments ago.
[0:03:32]	We can see the same for the memory usage where we don't see a dip. And then below we can see, hey okay, there are 5 pods running, 1 is pending.
[0:03:44]	This is also being displayed so we can now understand why did it actually show CRIT before. And we see, hey, there are 5 ready replicas and actually 1 '6'. And this is also being shown here in the deployment problems.
[0:03:57]	So, these are a couple of examples of the shelf alerts which come with Checkmk and we don't just want to see alerts, we also want to see how you can actually fix them. So, let's try to understand what is actually the problem here.
[0:04:11]	For that, I'm going to take a look at thevactual pod. I can either go through the pod overview here and see, okay hey, this part is red, I can't use that to navigate to the pot, or I use the deployment problem, say hey, this part is pending.
[0:04:28]	Let me take a look at it and in here I get to the classical service overview of Checkmk for this specific object, for this host. And you can see again, okay, here's the status. It's pending since 1 hour and 49 minutes.
[0:04:41]	I can see, okay, something isn't working and I can see here in the condition why it's actually not working.
[0:04:48]	You can see it's not scheduled and it can't be scheduled because no nodes are available. There are too many pods as insufficient CPU.
[0:04:58]	So, what is the solution to that? I mean either kills some pods. I mean this will not solve your problem in the long term.
[0:05:04]	I think the actual solution is add another node to your cluster because if you don't have enough CPU, I mean the easiest thing is just to add a node, and that's what we will do now.
[0:05:18]	I'm going to switch into my AWS console or user interface for my cluster. And I'm gonna edit node group here to just add another node.
[0:05:33]	I'm gonna save the changes, take a look at the update history here. A few seconds ago, it's in progress. It will take some time. And in a couple of minutes hopefully the problem is gone and our cluster is a little bit healthier.
[0:05:48]	Okay, so we are back. We can see the node has been rolled out here, at least, AWS thinks that. Let's take a look at the actual stuff in our monitoring.
[0:06:02]	Let's go back to our deployment and we can already see, hey, seems like it has worked. We have now 6 pods running here.
[0:06:13]	So, 6 desired and instead of the 5, we are now at 6. Here you can see now, there's actually nothing pending anymore, which is exactly what we wanted. There are no problems anymore here.
[0:06:26]	Let's go even a bit further back into our cluster. You can see we have now 4 nodes in our cluster and we have now 13 pods again.
[0:06:38]	Free for 13 free pods so we can actually run some more workloads on our cluster. So, adding the node helped us resolve a couple of problems in our cluster. We can see here, deployment has no problems anymore, it's good.
[0:06:55]	But you can still see a couple of our nodes are having issues. Some of our nodes don't look too healthy. Let's take a look at them to get a view on the node monitoring which comes with the Checkmk monitoring of Checkmk 2.1.
[0:07:11]	Okay, here is the view of the services of the host. And as you can see they still are here on the pod resources. That's the critical because it's still 3 pods are 0.
[0:07:24]	Because what has happened is that obviously the stuff gets on the new node but kubernetes doesn't move the workloads from this node to other nodes. So, this is actually quite unbalanced at the moment.
[0:07:37]	This is something, which, if you want, you can fix. You don't have to but it tells you a little bit, okay, about the balance inside your cluster.
[0:07:47]	Besides, the critical service which we see, we see a lot of other services which tell us a lot about the health of a node.
[0:07:55]	Let's take a look at the individual services which we see here for that node.
[0:07:59]	First thing is we see a condition and this condition comes from Kubernetes, from the Kubernetes API. Kubernetes thinks this node is healthy because it passes all the conditions. We can also see how many containers are running on this node.
[0:08:15]	We can see the CPU load and we can also see something like CPU resources and utilization. You might wonder why are there actually 3 services which concern themselves around the CPU.
[0:08:26]	The simple reason for that is there's the Kubernetes view which are the CPU resources and there's the actual node view.
[0:08:36]	Because Kubernetes doesn't know anything about the stuff running outside of Kubernetes on the node and there could be things running on this node that shouldn't be in the best practice.
[0:08:46]	But there could be things running on the node which consume the CPU load and it's good to know about these things, because if there's some performance problem on that node, you might want to know about that.
[0:08:58]	That's why we provide some holistic monitoring on your kubernetes nodes so that you can discover problems which reside outside of Kubernetes as well.
[0:09:06]	For example, your file system is your file system being filled up by some application near the Kubernetes cluster. I think it's important to know because in this case Kubernetes will not warn you about it but Checkmk can warn you about it.
[0:09:21]	You can also see some further information on what kind of actual node is it, what OS is running on it, what container runtime, some context information.
[0:09:30]	You can see further information on the Kernel Performance. You can see if the kubelet is actually running, very important if the kubelet is not running, then your node will not be working at all.
[0:09:42]	For Kubernetes, you can see memory here, it's very similar to CPU. It doesn't only matter what is running inside your kubernetes cluster, but also outside. And you can see further information around the node.
[0:09:58]	As you can see, we have a holistic monitoring for your nodes which is very important because all your workloads in your kubernetes cluster are running on nodes and being able to alert on that is very important.
[0:10:11]	And let's go actually into alerting because we have seen there are a couple of the shelf alerts. But we only focused on delivering really the alerts that you need.
[0:10:22]	But you might have your own use case, you might want to alert on specific things. And we can do that by configuring our own alert. Let's do that.
[0:10:31]	To configure an alert from service, we can just go to the burger menu here, click on Parameters for this services, and we want to create alert for cpu resources.
[0:10:44]	Now we see there's this rule which is applicable. We just click here. Then we get directly to the rule which is relevant for us.
[0:10:53]	I create a rule, a generic rule and I, for example, say hey, I want to be alerted for limits utilization.
[0:11:02]	Because there could be the case that, hey, I don't want that my pods running to CPU throttling, something which will happen if your limit utilization reaches 100%.
[0:11:15]	Kubernetes will just start throttling your pods and this will lead into performance problems for your application so you might want to have an alert for that. And you want to be alerted there a little bit before it actually happens so we can just use that stuff.
[0:11:30]	And now let's define for which objects this alert should be valid. So, we can use labels. Labels are super powerful, and Checkmk 2.1 comes with a lot of predefined labels for Kubernetes. One of them is anything around objects.
[0:11:51]	So, for example, I want to have this rule only applicable for pods. And I don't care that it works for all deployments or stuff I have a specific deployment which is really important for me.
[0:12:11]	And the deployment is called, let's take a look, I know what it's called. It's php-apache.
[0:12:16]	This is the deployment. I only want that, this alert is valid for pods of this deployment. That's it. That's how easy I can configure an alert, for example, for CPU limit utilization.
[0:12:32]	Let's save that and activate the changes. And to actually see this alert in action, let's first go again into our dashboard.
[0:12:49]	Let's take a look at the deployment dashboard here for that one. And right now we are pretty far away from an actual limit utilization as we can see here.
[0:13:06]	It's pretty far off here. We don't get alert yet. So, what I will do is I'm gonna increase the workload on this deployment.
[0:13:14]	Okay, to actually increase the workload on this deployment, I have a little nice fake application here.
[0:13:28]	And let's scale this one up. It's called infinite-calls. And what I will just do now is I will increase the workload by adding some more pods onto it. Let's try 35.
[0:13:49]	And let's wait a little bit until all of them are spawned. And then we can hopefully see an alert in our Checkmk monitoring.
[0:14:00]	So, we are back now in our deployment dashboard and we can already see there's the first alert here, yeah. So, you can already see 1 pod here has a usage of 0.4.
[0:14:11]	And if we take a look at this, we can see the limit utilization is at 85%. We have a limit of 0.5.
[0:14:19]	We can actually take a look in detail into the metrics here. And as you can see the limits is the line up here, the green area is the usage. And we actually have a graph as well for the limit utilization, that's where we now have our monitoring.
[0:14:41]	You can see immediately where our thresholds are for all alerts. We have our warning on 80% and our critical on 90%.
[0:14:50]	And you can see here, we now have already reached the 80%, so we already know there is quite a lot of workload on that individual pod. We are not at the maximum yet. We are not at 100 yet but we are not too far away from that.
[0:15:05]	And for that we will now see a clean alert in our emails, for example, in our slack channels, however you have configured that. And that's how easy it is to set up alerts with the new Kubernetes monitoring of Checkmk 2.1.
[0:15:18]	I hope you liked it and I hope you have a lot of fun monitoring Kubernetes yourself. Thanks for watching please like and subscribe.