Ep. 54: Monitoring OpenShift clusters with Checkmk

[0:00:00] Hello, welcome to the Checkmk channel. Today we'll show you how you can monitor OpenShift clusters with Checkmk.
[0:00:11] OpenShift is a Kubernetes distribution by Red Hat. With Checkmk you can monitor the health and the performance of the cluster itself, the infrastructure, and the workload of the OpenShift cluster. Let's take a look how you can configure that with Checkmk.
[0:00:32] Our journey starts with the official Checkmk user guide, here we will find all steps required to setup OpenShift monitoring. Let's search for the OpenShift article, and in this article we will learn about specifically the prerequisites in the cluster which we have to create.
[0:00:52] And among them the first one is we have to create a namespace. Let's just copy the command and run it in our shell. We've created a namespace “checkmk-monitoring”. The next thing we do is, we will apply this YAML file to our cluster, and before we do it, I always recommend taking a look into it so that you don't install something weird into your system. So the YAML file specifies a service account a cluster role and a cluster role binding. The cluster role, you can see what access Checkmk will get into your cluster, it will be able to see these resources and see details on this on these resources. It's a very limited set of rights it has, so it's quite safe to implement that.
[0:01:44] Let's run this, and we can now see that we have created a service account a cluster role and a cluster role binding. Now we have the requirements on that part, next thing is we need to understand where is the API endpoint which we need to monitor the cluster. For this we can use the command 'cluster-info'. Which will tell us that the Kubernetes control plane is running at this address.
[0:02:15] We'll need this later in the configuration of Checkmk. Next thing is, what we need is the Prometheus endpoint because what we do with Checkmk is, we query the Kubernetes API, but also the Prometheus API endpoint to get all the data required for monitoring OpenShift. Here we can see it. This is the end point which we need to configure later on.
[0:02:49] Lastly, we need to get a token to ensure that we have secure communications and that we are the ones allowed to actually communicate to the API. You can run this command, and you will get this token, save it for later as well. And lastly we can get the certificate in this case I will jump over that step I recommend however in production environments to get the certificate and apply it to ensure that you have secure communications with a cluster. As the next step we have to set up our Checkmk monitoring. You can also follow the guidelines in here but as we don't have to copy any commands any more I will just move directly into Checkmk. We are now in Checkmk and the first thing is we do is we save the password of the token in our password store, so that we can reuse them later on in a safe fashion.
[0:03:53] We can add a password I call it my OpenShift _cluster, and I go back to the command line and copy this token. Now it's saved in our password store. The next thing what we do is we create a host, and this is a host where all the data will be stored from the OpenShift cluster and which will query the OpenShift cluster. Let's call it “openshift-cluster”, and we don't have to assign an IP address to it, we can just leave it like that, and just save it and view the folder.
[0:04:44 As the next piece, we configure the connection for that we go on the 'Setup', to “VM, Cloud, Container” and then to Kubernetes. We add a rule. I call this "OpenShift". I use the common function to tell that I created that as the “cmkadmin” and this next steps I specify the cluster in this case it's my OpenShift test cluster “openshift-test”. I use from the password store the token, and now I have to enter the API server connection. For that I go back to my console and here you can see this is what I copied before copy this over and can paste it here.
[0:05:34] Next I choose to enrich the usage data with data from OpenShift, so you have to select in the drop-down box this option, and then we can specify the Prometheus API endpoint. And this is the one which we have in here. So also copy it over and save it in here. I leave the standard options in here you can also decide to monitor specific namespaces to ensure that the cluster resource aggregation this means basically which data is included in the available resources in your cluster you can specify that. Typically, you don't have to do it, and you can also decide to import annotations, so that you can use them later on in host labels and Checkmk rules. Also, not do it in this case, but it's a powerful option. Lastly we have to assign this rule to the “openshift-cluster”. So the Checkmk knows that this host will query the data from the OpenShift environment. As a last step we configure the Dynamic host management.
[0:06:45] We add a new connection we call it “openshift”. I will call it open shift again. We use the connector type 'piggyback data' and we restrict the source host to the “openshift-cluster” to ensure that only that host is being used. Then we use the standard sync interval of 1 minute and leave the standard options. We also decide that we delete host without piggyback data that means if a pod vanishes from OpenShift or an application vanishes from OpenShift, this also vanishes from the monitoring. With ephemeral workloads like OpenShift this is a use useful option because you don't want that applications still being monitored.
[0:07:37] We save it, and now we just have to activate the changes. And with that, we can now take a look into to our OpenShift host. Let's rerun it, see if the connection works. And we can see we were able to connect to it. And Checkmk discovered a lot of services on it, and so we can now basically run the service discovery. → “Run service discovery”.
[0:08:23] Checkmk immediately discovers the basic information on the cluster here, we can see them under “Undecided services”. Have we able to successfully carry the data from Prometheus, the CPU resources on the cluster, how the requests are utilized, if the API is available and ready, how the memory is being consumed on the cluster, how many nodes we have. We can see it's a very small cluster. And how many pods are running. Let's accept these changes. Activate the changes again and with that we have set up our OpenShift monitoring.
[0:09:05] Checkmk provides out-of-the-box dashboards for Kubernetes which you'll find in the 'Monitor' menu under 'Applications' and then 'Kubernetes'. You can see here our “openshift-test” cluster. We'll have to wait a little bit until the data is populated typically one or two minutes then we'll have sufficient data so that every "Dashlet" is displayed. So our OpenShift monitoring has started to gather data. We can see the OpenShift test cluster which we just configured previously, we can see the CPU resources being consumed on the cluster and the memory resources, how many pods are running, how many work nodes. We have also some more meta information like the version of our OpenShift cluster. And if you would configure more clusters they would all show up in here, and you would also see more detailed information on the graphs in the right side.
[0:10:08] Let's take a look at the individual cluster, and you can go into a more detailed dashboard by clicking on the name of the cluster, and then you will get in a detailed dashboard. Let's zoom out a little bit, so that we can see everything. And we are able to see all relevant metrics directly in one view. 
[0:10:28] We can see the “DaemonSets”, the “StatefulSets”, and the deployments running on your system. We can see which ones for example consume the least amount of memory, we can also see which consume the most amount of memory. We can see how the cluster is utilized. And you can see the metrics are being gathered and the longer you will have it running the more information you will of course then have. We can also see cluster problems, you can here see that there is a problem with a “Container Installer” application running in a pod for the “Kube-controller-manager”. 
[0:11:11] We can also see that a deployment of the “openshift-console”, only has one out of two ready pods. And we can see that a pod of exactly these things as running in into an error, and we can then look into further details once we move further down in our dashboards. Moving further down is quite easy. You can just for example select, hey let's take a look at the “apiserver” deployment, and then you will get a deployment dashboard, which provides you details about all the pods running of this deployment. You can see the usage, the memory usage, and we can then also if you want to do so further dive into a specific pod and see details on that. For example, the condition. The containers, you have in this case two containers running in it see which images are being used, and we can see some meta information. The "Phase" they're running, how many restarts they had in the last hour and in total and how long this is up now.
[0:12:23] That's the OpenShift monitoring with Checkmk. So, and if you want to dive deeper into it just go on to the “Kubernetes” in menu entry in the 'Applications' and you will have an overview over all your OpenShift clusters.
[0:12:42] That was OpenShift monitoring with Checkmk. With just a few steps we were able to configure the monitoring, and we're able to get full visibility into the entire cluster, from the top level to the very detailed level of the pods. Checkmk does all this without a lot of effort on your OpenShift cluster, so we don't install any workloads on it, we gather the data via the Kubernetes API and the Prometheus' endpoints. And with that, provide a very secure way how to monitor OpenShift clusters.
[0:13:12] Thank you very much for watching, and see you soon again.

Interested in learning more? Register for a dedicated Synthetic Monitoring training course.

Check the schedule

More Checkmk Videos