Ep. 5: Using metrics and graphs in Checkmk

To load this YouTube video you are required to accept advertising cookies.

Note: All the videos on our website offered in the German language have English subtitles and transcripts, as given below.

[0:00:00]	And today we are going to ...
[0:00:15]	Today’s episode is about measurements and graphing.
[0:00:18]	When Checkmk monitors a service, it not only determines whether it is OK, WARN, or CRIT -- it also records values such as CPU usage, or for a switch port the number of packets or bytes transmitted.
[0:00:32]	These values are very interesting – on the one hand for fault diagnosis, but of course you can also use them for planning.
[0:00:37]	I’ll now show you what that looks like in the interface.
[0:00:41]	So that we have interesting, and the most realistic possible data, again I am now using our in-house monitoring at tribe29. Now I’m just looking for a computer.
[0:00:52]	Here is a host, and here you can see the services in the host, and with some of the services you can see a small icon on the right which indicates that there are measurements for this host.
[0:01:03]	Now I’m going to select the service interface ‘eth0’.
[0:01:06]	This is a Linux host, and this is the network card with which it is connected to the LAN.
[0:01:12]	When I click on the symbol, I get an overview of all its measured values and the chronological sequence in a large representation.
[0:01:22]	Checkmk stores every measured value for up to four years.
[0:01:25]	Now it is easy to calculate that if you generate one value per minute, and continue to do that for four years, we'll accumulate large amounts of data.
[0:01:31]	So that you can still save this data on normal hard drives, over time the data will be compressed.
[0:01:37]	You get one reading per minute for two days. Then one reading every five minutes for ten days. One reading every half hour for 90 days, and in the very long range of up to four years, a reading every six hours.
[0:01:52]	Each time the measured values are summarized; the minimum, the maximum and the average are saved. This means that after four years you still have access to these three different aggregated functions.
[0:02:07]	Here you can see the time range for two measurements. Here it is in green for the input bandwidth, and blue for the output bandwidth from the network interface, eth0, on this host, ci.lan.tribe29.com.
[0:02:22]	Bandwidth here means the number of bytes transferred per second, or it is specified here in bits. So in this example up to 100 megabits.
[0:02:31]	If you move the mouse over an item, in this small window you will see the measurements from this element at this point in time.
[0:02:38]	You can also get in with the mouse wheel, or with two fingers on the touchpad, which makes the selection a little easier.
[0:02:45]	On the right side you will find different time periods, each with a preview.
[0:02:48]	For example, here you can see the measured value for the last eight days.
[0:02:56]	Another option you have is to move it horizontally.
[0:02:59]	To do this, I simply click on the graph with the left mouse button, and drag it left and right.
[0:03:03]	So you can move horizontally. If you go up and down, you can also zoom-in vertically.
[0:03:10]	The graph is usually scaled so that the largest measurement is always fits exactly into the scale, but maybe you want to examine a little peak, or a section where the value is relatively low – because you can simply zoom-in vertically and enlarge it as needed.
[0:03:27]	The whole graph can also be enlarged or reduced with this corner here.
[0:03:30]	If I want to make it smaller here, I just pull it over to there – or I can also make it bigger.
[0:03:35]	And of course practical as always, folding the sidebar in gives a bit more space.
[0:03:41]	Many services have not just one metric, but several. So next...
[0:03:45]	I now scroll down here.
[0:03:46]	For example, you can see network interfaces or switch ports on the network card.
[0:03:51]	The next measurement is the number of packets that are being transmitted.
[0:03:55]	You can also clearly see that when I compare them, the amount of data is much larger for input than for output, but the number of packages is relatively similar.
[0:04:06]	Then there is an interesting graph for the errors, which luckily now is empty here – as it should be.
[0:04:12]	But I see here in the preview, for example, that in the past 35 days there was a point where there were errors.
[0:04:19]	I can look at them here, and see that there was actually a job with transmission errors here.
[0:04:27]	I can’t judge whether that was bad or not. You could simply investigate that now. You can do a whole lot more with the graphs.
[0:04:35]	There are, for example, so-called graph collections, where you can find individual graphs in a collection like that and then compare.
[0:04:42]	You can embed graphs in dashboards. You can attach graphs in reports.
[0:04:47]	You can also create complete custom graphs by using very specific metrics from particular hosts and services summarized in a graph.
[0:04:55]	And since recently there is now also a connector for Grafana, so that Grafana can directly access Checkmk’s metric data, so that the graphs can then be displayed in Grafana dashboards.
[0:05:06]	And as always all of the details can be found in the manual.
[0:05:10]	That was it for today.
[0:05:11]	I hope this was useful for you, and we’ll see each other in the next episode.