Ep. 27: Detecting issues and configuring alerts for Kubernetes clusters
Read Video Transcript
|[0:00:00]||In this video, we will show you how to detect issues in your Kubernetes clusters and configure alerts.|
|[0:00:14]||Sooner or later in every Kubernetes cluster there are problems, their performance problems, their capacity constraints.|
|[0:00:25]||There are just workloads which are not working as expected. And for that, you need monitoring.|
|[0:00:30]||In this video, I will first show you how the new Kubernetes monitoring looks like, how you can discover issues there. We will also resolve 1 issue which we find.|
|[0:00:40]||And then I will show you how to configure your own alerts easily. Let's now have a look.|
|[0:00:48]||Okay, let's start by looking at our Kubernetes Cluster dashboard.|
|[0:00:53]||This is your entry point into your Kubernetes monitoring and it provides you an overview of the health of your cluster.|
|[0:01:00]||And before we go into the actual alerting, let's try to understand how actually healthy is my cluster.|
|[0:01:05]||On the right part here, I can see the CPU resources, the Memory resources, and the Pod resources of my cluster.|
|[0:01:14]||I can see how much usage is actually on CPU, on the cluster, on the total, what are the requests, what are the limits and how much is actually allocatable.|
|[0:01:27]||Yeah, you can see that over time here. You can also see the entire utilization same for memory utilization. And I can also see here, okay, how many pods actually running. I can see that 3 are pending, 1 is free.|
|[0:01:44]||And they have a total of 51 which I can allocate in this cluster, which leads me to the first alert which I see here, which is this big red thing here, which tells me, hey, there's only one free pod in my cluster, which is not too great.|
|[0:02:03]||Below these performance metrics, I can actually see here what are the problems in my cluster.|
|[0:02:09]||There's a table of all the issues which your cluster has. You can see, hey, there's only 1 free pod on the entire cluster, a couple of nodes have actually no pods free at all.|
|[0:02:21]||I have here a deployment which seems to have a couple of problems and so on. On the left part, on top I can actually first see, okay, some status information, do I actually get data, is the API healthy.|
|[0:02:37]||I can see the health of my nodes. It doesn't look too great there as well and I can see below DaemonSets, StatefulSets, and Deployments.|
|[0:02:44]||So, you can see workloads in my cluster. I can see how much is being consumed for CPU memory.|
|[0:02:52]||For each of the workloads, I can see who is consuming a lot. So, here this deployment is consuming the most in my cluster and it also has a problem, yeah.|
|[0:03:02]||We can take a look at that by clicking here on the name of the deployment, which will lead us to the dashboard for the deployment, which is very easy to understand what's happening here now.|
|[0:03:14]||So, we have again some performance metrics, what is the CPU usage of that specific deployment, and overall so the sum of all the pods in it.|
|[0:03:25]||We can see also the individual pods here to the very right. You can see, okay hey, there was a dip a couple of moments ago.|
|[0:03:32]||We can see the same for the memory usage where we don't see a dip. And then below we can see, hey okay, there are 5 pods running, 1 is pending.|
|[0:03:44]||This is also being displayed so we can now understand why did it actually show CRIT before. And we see, hey, there are 5 ready replicas and actually 1 '6'. And this is also being shown here in the deployment problems.|
|[0:03:57]||So, these are a couple of examples of the shelf alerts which come with Checkmk and we don't just want to see alerts, we also want to see how you can actually fix them. So, let's try to understand what is actually the problem here.|
|[0:04:11]||For that, I'm going to take a look at thevactual pod. I can either go through the pod overview here and see, okay hey, this part is red, I can't use that to navigate to the pot, or I use the deployment problem, say hey, this part is pending.|
|[0:04:28]||Let me take a look at it and in here I get to the classical service overview of Checkmk for this specific object, for this host. And you can see again, okay, here's the status. It's pending since 1 hour and 49 minutes.|
|[0:04:41]||I can see, okay, something isn't working and I can see here in the condition why it's actually not working.|
|[0:04:48]||You can see it's not scheduled and it can't be scheduled because no nodes are available. There are too many pods as insufficient CPU.|
|[0:04:58]||So, what is the solution to that? I mean either kills some pods. I mean this will not solve your problem in the long term.|
|[0:05:04]||I think the actual solution is add another node to your cluster because if you don't have enough CPU, I mean the easiest thing is just to add a node, and that's what we will do now.|
|[0:05:18]||I'm going to switch into my AWS console or user interface for my cluster. And I'm gonna edit node group here to just add another node.|
|[0:05:33]||I'm gonna save the changes, take a look at the update history here. A few seconds ago, it's in progress. It will take some time. And in a couple of minutes hopefully the problem is gone and our cluster is a little bit healthier.|
|[0:05:48]||Okay, so we are back. We can see the node has been rolled out here, at least, AWS thinks that. Let's take a look at the actual stuff in our monitoring.|
|[0:06:02]||Let's go back to our deployment and we can already see, hey, seems like it has worked. We have now 6 pods running here.|
|[0:06:13]||So, 6 desired and instead of the 5, we are now at 6. Here you can see now, there's actually nothing pending anymore, which is exactly what we wanted. There are no problems anymore here.|
|[0:06:26]||Let's go even a bit further back into our cluster. You can see we have now 4 nodes in our cluster and we have now 13 pods again.|
|[0:06:38]||Free for 13 free pods so we can actually run some more workloads on our cluster. So, adding the node helped us resolve a couple of problems in our cluster. We can see here, deployment has no problems anymore, it's good.|
|[0:06:55]||But you can still see a couple of our nodes are having issues. Some of our nodes don't look too healthy. Let's take a look at them to get a view on the node monitoring which comes with the Checkmk monitoring of Checkmk 2.1.|
|[0:07:11]||Okay, here is the view of the services of the host. And as you can see they still are here on the pod resources. That's the critical because it's still 3 pods are 0.|
|[0:07:24]||Because what has happened is that obviously the stuff gets on the new node but kubernetes doesn't move the workloads from this node to other nodes. So, this is actually quite unbalanced at the moment.|
|[0:07:37]||This is something, which, if you want, you can fix. You don't have to but it tells you a little bit, okay, about the balance inside your cluster.|
|[0:07:47]||Besides, the critical service which we see, we see a lot of other services which tell us a lot about the health of a node.|
|[0:07:55]||Let's take a look at the individual services which we see here for that node.|
|[0:07:59]||First thing is we see a condition and this condition comes from Kubernetes, from the Kubernetes API. Kubernetes thinks this node is healthy because it passes all the conditions. We can also see how many containers are running on this node.|
|[0:08:15]||We can see the CPU load and we can also see something like CPU resources and utilization. You might wonder why are there actually 3 services which concern themselves around the CPU.|
|[0:08:26]||The simple reason for that is there's the Kubernetes view which are the CPU resources and there's the actual node view.|
|[0:08:36]||Because Kubernetes doesn't know anything about the stuff running outside of Kubernetes on the node and there could be things running on this node that shouldn't be in the best practice.|
|[0:08:46]||But there could be things running on the node which consume the CPU load and it's good to know about these things, because if there's some performance problem on that node, you might want to know about that.|
|[0:08:58]||That's why we provide some holistic monitoring on your kubernetes nodes so that you can discover problems which reside outside of Kubernetes as well.|
|[0:09:06]||For example, your file system is your file system being filled up by some application near the Kubernetes cluster. I think it's important to know because in this case Kubernetes will not warn you about it but Checkmk can warn you about it.|
|[0:09:21]||You can also see some further information on what kind of actual node is it, what OS is running on it, what container runtime, some context information.|
|[0:09:30]||You can see further information on the Kernel Performance. You can see if the kubelet is actually running, very important if the kubelet is not running, then your node will not be working at all.|
|[0:09:42]||For Kubernetes, you can see memory here, it's very similar to CPU. It doesn't only matter what is running inside your kubernetes cluster, but also outside. And you can see further information around the node.|
|[0:09:58]||As you can see, we have a holistic monitoring for your nodes which is very important because all your workloads in your kubernetes cluster are running on nodes and being able to alert on that is very important.|
|[0:10:11]||And let's go actually into alerting because we have seen there are a couple of the shelf alerts. But we only focused on delivering really the alerts that you need.|
|[0:10:22]||But you might have your own use case, you might want to alert on specific things. And we can do that by configuring our own alert. Let's do that.|
|[0:10:31]||To configure an alert from service, we can just go to the burger menu here, click on Parameters for this services, and we want to create alert for cpu resources.|
|[0:10:44]||Now we see there's this rule which is applicable. We just click here. Then we get directly to the rule which is relevant for us.|
|[0:10:53]||I create a rule, a generic rule and I, for example, say hey, I want to be alerted for limits utilization.|
|[0:11:02]||Because there could be the case that, hey, I don't want that my pods running to CPU throttling, something which will happen if your limit utilization reaches 100%.|
|[0:11:15]||Kubernetes will just start throttling your pods and this will lead into performance problems for your application so you might want to have an alert for that. And you want to be alerted there a little bit before it actually happens so we can just use that stuff.|
|[0:11:30]||And now let's define for which objects this alert should be valid. So, we can use labels. Labels are super powerful, and Checkmk 2.1 comes with a lot of predefined labels for Kubernetes. One of them is anything around objects.|
|[0:11:51]||So, for example, I want to have this rule only applicable for pods. And I don't care that it works for all deployments or stuff I have a specific deployment which is really important for me.|
|[0:12:11]||And the deployment is called, let's take a look, I know what it's called. It's php-apache.|
|[0:12:16]||This is the deployment. I only want that, this alert is valid for pods of this deployment. That's it. That's how easy I can configure an alert, for example, for CPU limit utilization.|
|[0:12:32]||Let's save that and activate the changes. And to actually see this alert in action, let's first go again into our dashboard.|
|[0:12:49]||Let's take a look at the deployment dashboard here for that one. And right now we are pretty far away from an actual limit utilization as we can see here.|
|[0:13:06]||It's pretty far off here. We don't get alert yet. So, what I will do is I'm gonna increase the workload on this deployment.|
|[0:13:14]||Okay, to actually increase the workload on this deployment, I have a little nice fake application here.|
|[0:13:28]||And let's scale this one up. It's called infinite-calls. And what I will just do now is I will increase the workload by adding some more pods onto it. Let's try 35.|
|[0:13:49]||And let's wait a little bit until all of them are spawned. And then we can hopefully see an alert in our Checkmk monitoring.|
|[0:14:00]||So, we are back now in our deployment dashboard and we can already see there's the first alert here, yeah. So, you can already see 1 pod here has a usage of 0.4.|
|[0:14:11]||And if we take a look at this, we can see the limit utilization is at 85%. We have a limit of 0.5.|
|[0:14:19]||We can actually take a look in detail into the metrics here. And as you can see the limits is the line up here, the green area is the usage. And we actually have a graph as well for the limit utilization, that's where we now have our monitoring.|
|[0:14:41]||You can see immediately where our thresholds are for all alerts. We have our warning on 80% and our critical on 90%.|
|[0:14:50]||And you can see here, we now have already reached the 80%, so we already know there is quite a lot of workload on that individual pod. We are not at the maximum yet. We are not at 100 yet but we are not too far away from that.|
|[0:15:05]||And for that we will now see a clean alert in our emails, for example, in our slack channels, however you have configured that. And that's how easy it is to set up alerts with the new Kubernetes monitoring of Checkmk 2.1.|
|[0:15:18]||I hope you liked it and I hope you have a lot of fun monitoring Kubernetes yourself. Thanks for watching please like and subscribe.|
More Checkmk Videos
Ep. 1: Installing Checkmk 2.0 and monitoring your first host
In this video, Baris explains how to take get started with Checkmk and start monitoring your first host within a few minutes.
Ep. 2: The Checkmk 2.0 user interface
In this video, Baris take you through the new user interface in Checkmk 2.0. He explains the various components of the User interface such as the new navigation menus, the Sidebar, main dashboard, tactical overview, how to switch between the Checkmk interface themes and much more
Ep. 3: Using SNMP to monitor network devices in Checkmk 2.0
In this episode, Baris explains how to monitor network devices with Checkmk. SNMP is a protocol that many switches, routers, printers, UPSs, hardware sensors and other devices have implemented with the purpose of being able to monitor them easily.
Ep. 4: Monitoring Windows in Checkmk
In this video of our Getting started with Checkmk series, Baris explains how to install a Checkmk agent on a Windows host system and add that into your monitoring environment.
Ep. 5: Using metrics and graphs in Checkmk 2.0
In the 5th episode of the Getting started with Checkmk series, Baris explains using various metrics that you can monitor in Checkmk such as CPU utilization, CPU load etc. You can also see graph visualizations for these metrics or create and customize your own as per your requirements.
Ep. 6: Updating Checkmk 2.0 and using multiple instances
In this video, Baris explains how to update your Checkmk instance. It is very easy and can be done within minutes. You can run multiple Checkmk instances with different versions on the same system. This gives you the flexibility to test the new version before using it in production.
Ep. 7 (part 1): Working with rules and setting thresholds in Checkmk
In the following three-part videos series, Baris explains rule-based monitoring with Checkmk. In the first part, he shows you how you can work with rules and set threshold values. Rule-based configuration is one of the key features for Checkmk which helps you to scale your monitoring easily within minutes.
Ep. 7 (part 2): Smart rules with Host Tags in Checkmk
In the second part of this video, Baris explains using Smart rules with host tags in Checkmk. In the first part, he shows you how you can work with rules and set threshold values. These are features that you can use to build your rules even more intelligently and to better organize your monitoring.
Ep. 7 (part 3): Managing Hosts in Folder in Checkmk
In this final part of our episode on Rule-based monitoring in Checkmk, Baris demonstrates how to manage hosts in folders in Checkmk. This helps you to apply your monitoring configurations at scale and organize your hosts according to your needs.
Ep. 8: Working with Host and Service Groups in Checkmk
In this Baris demonstrates how to create host and service groups in Checkmk, so you can perform actions on an entire group instead of configuring each of them individually.
Ep. 9: Using the Quicksearch function in Checkmk
In this episode of the Checkmk tutorials, Baris shows how you can use the Quicksearch function in Checkmk. You can use it to easily find and manage certain hosts or services. He also explains some examples of filters to you. In Checkmk 2.0 you can use the same syntax in the Seach function found in the monitor menu to get identical results.
Ep. 10: Detecting configuration errors with the Analyze Configuration feature
With the Analyze Configuration feature, you can check if there are any configuration errors in your installation. Checkmk controls a number of possible security risks or potential performance restrictions and indicates if there are any problems.
Ep. 11: View creation and customization in Checkmk
In this video, Baris demonstrates how to customize headers, columns, and more in Views in Checkmk for yourself or other users. He also explains how to create custom views and add desired information to these views.
Ep. 12: Acknowledging problems in Checkmk
In this video, Baris explains how you can acknowledge problems in Checkmk. This function helps you to qualify the states of hosts and services. This allows you to keep track of messages in the main dashboard and, for example, you can add comments to problems.
Ep. 13: Scheduling downtimes in Checkmk
In the episode of our Getting started with Checkmk series, Baris explains how you can manage the maintenance times of your systems in Checkmk. Such scheduled downtimes prevent your monitoring from sending false alarms when a host or service goes to WARN or CRIT during maintenance work. You can also inform the users concerned about the maintenance via Checkmk.
Ep. 14: Distributed monitoring with Checkmk
In this video, Baris explains how you can connect several Checkmk instances to a monitoring system and then manage it.
Ep. 15: MKPs and Plugins in Checkmk
In the 15th episode of our Getting started with Checkmk tutorial series, Baris explains what are Checkmk Extension Packages (MKPs) and how easy it is to integrate them into your Checkmk monitoring environment. MKPs are the preferred format when you make your own extensions as it makes it easy to share with other users or deploy in distributed environments.
Ep. 16: Working with 'Bulk Actions' in Checkmk
In this episode of our Checkmk tutorials series, Baris explains how you can save a lot of time with bulk actions. With this feature you can perform various tasks such as deleting, renaming, service discovery etc. on a large number of hosts simultaneously.
Ep. 17: Working with network topologies in Checkmk
In this video of our gettign startted with Checkmk series, Baris explains how to map network topologies in Checkmk. This feature is quite helpful to manage your network and prevent any unnecessary notifications from the devices in your network.
Ep. 18: Creating and customizing dashboards in Checkmk
In this video of our Getting started with Checkmk series, Mathias explains how you can create and customize dashboards in Checkmk 2.0, so you can get insights into your monitoring according to your requirements. Find out more in this video.
Ep. 19: Monitoring websites and their certificates with Checkmk
In this episode, Bastian demonstrates how to monitor a website and its certificate with Checkmk. You can also monitor specific web pages with Checkmk by using the several options that will suit your use case. Learn more in this video.
Ep. 20: Configuring dashboard elements in Checkmk
Learn how to add data visualization elements of the various metrics into your Checkmk Dashboard. In this video, Mathias explains how you can configure these elements and create a dashboard as per your requirements.
Ep. 21: Setting up notifications in Checkmk
Learn how to set up notifications in Checkmk and assign relevant contacts and contact groups to be notified for various events. Later in this video, our presenter Bastian also demonstrates how you can set up rule-based notifications according to different conditions for hosts and services.
Ep. 22: Monitoring logfiles with Checkmk
Monitor your logfiles with Checkmk using its Logwatch plugin. It is very useful when you want to monitor your logfiles regardless of whether you are using a UNIX/Linux or a windows based system. Learn more in this video.
Ep. 24: 3 Rules for efficient network monitoring
In this video, Bastian demonstrates 3 rules that will help you to efficiently monitor your network interfaces. With Checkmk 2.0, with just three rules, you can set up an efficient network monitoring that will not only monitor all of your network interfaces but also simultaneously provide a detailed overview of all of your ports.
Ep. 25: New UX and security improvements in Checkmk 2.1
Checkmk 2.1 come with many UX improvements such as pre-built dashboards for Linux and Windows, faster core performance and much more. Security features such as two-factor authentication etc. were also added in this new version. Watch this video to learn how to use these new features and enhancements in Checkmk.
Ep. 28: Working with InfluxDB integration in Checkmk
Learn how to send data to InfluxDB from Checkmk. As InfluxDB introduced a new protocol to send data to it, a new connector was developed with Checkmk to talk natively with it. Learn more about it in this video.
Ep. 29: New agent architecture in Checkmk 2.1
With Checkmk 2.1, the agent architecture was modified to enable performance improvements and add new features such as TLS encryption, data compression, and the reversal of direction of communication from the agent. This will enable push mode and pull mode.
Ep. 30: Clustering the Checkmk appliance
In this video, Robin demonstrates how you can cluster your Checkmk appliance to make it resilient against hardware failures. If you are using the Checkmk hardware appliance, it may be helpful to cluster your appliance to maintain high availability.
Ep. 32: Working with the Agent bakery in Checkmk
In this video, Robin demonstrates how to roll out agent packages with the required configuration for different monitored systems using the agent bakery in Checkmk. The "Automatic agent update" is quite a helpful feature as it pulls the latest configurations for an agent automatically and you don't need to manually update all of your agents deployed on different systems.
Ep 33: Monitoring Docker containers with Checkmk
Learn how to monitor Docker containers with Checkmk.In this video, Robin demonstrates the process of setting up a rule to configure the docker plugin and bake an agent with the desired settings for the Docker host.
Ep 34: Introduction to Checkmk Ansible collection
Last year the Checkmk Ansible collection was created to interact with the Checkmk REST API. In this video, Robin demonstrates how you can use this Ansible collection to automate your monitoring with Checkmk.
Ep 35: Monitoring SQL databases with Checkmk
In this video, Robin demonstrates how you can configure your Checkmk site to monitor your SQL databases. As there are many flavours of SQL databases, the process is mostly the same.