Ep. 14: Distributed monitoring with Checkmk

To load this YouTube video you are required to accept advertising cookies.

Note: All the videos on our website offered in the German language have English subtitles and transcripts, as given below.

[0:00:00] Distributed monitoring, what exactly is behind this – you will see if you stay tuned...
[0:00:15] Distributed monitoring involves combining multiple Checkmk servers into one large monitoring system.
[0:00:23] There can be a number of reasons why you might want to do this.
[0:00:26] One would be for example scaling, saying that with only one Checkmk server I can't monitor as many hosts as I currently have.
[0:00:32] Checkmk is really very performant, and it scales well, but at some point it will have reached what a physical server can do, and then you can simply say you will combine two or three or four servers – as many as you need – and just connect these to form a large system.
[0:00:49] Another reason for a distributed monitoring could be that although different servers are administered by different organisations, i.e. for organisational reasons, you still want to have an overall view of the complete monitoring.
[0:01:01] I have a similar example when I address the subject of networks. For example, it may be that you have a location in Munich and one in, say, Singapore, and you just don't want all of the monitoring data to go from Munich to Singapore and back. That means that for every single host I want to query, data always has to cross the Atlantic, or the Pacific Ocean to be more precise in this case, but that instead I have my own local Checkmk on site, which I then somehow reconnect in such a way that this all becomes one big system.
[0:01:30] The issue of availability is also somewhat similar.
[0:01:33] Perhaps you also want the monitoring in a remote location to be able to continue even if the WAN connections are not available, or it may be that, for security reasons, monitoring from location A into location B or into network segment B is not possible at all, or that there is a DMZ, a secure area where monitoring access is not actually desired. It is possible to create a separate site for this secure area, and then reintegrate the site into a large monitoring system.
[0:02:03] And how that works – that is, that I assemble a large monitoring system from several Checkmk servers – I will now show you.
[0:02:12] The starting point is a quite normal monitoring system, for example the one that we used in the previous episode for explaining maintenance times.I have 5 hosts here, everything is normal.
[0:02:25] The next step is for me to create a remote system, that is, a second system that I want to integrate. For this purpose I now switch to the command line on a Checkmk server. It doesn't have to be the same one right now, it can be a different one. In my example, for simplicity's sake, it's the same server, but that doesn't matter at all. I will now create a site on this, which I will call 'remote1'. Everything is just as you know it, everything as usual.
[0:02:58] I am also changing the password. To do this I first go into the site, as the site user, and now issue the command from there. And here type in a password that I can easily remember.
[0:03:13] So, now the first important step is to allow access from a remote location.The way this works is that the master site uses TCP to access the remote site via a specific TCP port in order to retrieve the status data from that remote site.
[0:03:31] For this purpose I call 'omd config', then go to the 'Distributed Monitoring' option, and there I activate 'LIVESTATUS_TCP'. This is the crucial point.
[0:03:46] I could and should here restrict from which IP addresses this server should be reachable. And I can also switch on encryption, which is switched on here by default.
[0:04:00] I will leave this address restriction out for now, but normally it is recommended to enter your master system's address here.
[0:04:09] So, I then return to the main menu, exit from it, and now I start this site. And with that I'm finished at this point.
[0:04:22] The remote instance is now ready to be included in the monitoring.
[0:04:29] Of course it is important that you remember this port number – by default the TCP port 6-5-5-7 is suggested. I have not changed that here either. If you want to mount several sites on one server, you will of course have to assign different TCP port numbers to each.
[0:04:43] All further steps I can now perform directly on the master's interface.
[0:04:48] To do this, in WATO you go here to the 'Distributed Monitoring' module, where I need to create a new connection to the remote site.
[0:05:00] My own site already appears here, which means that this was done automatically, because you always have at least one site in the monitoring system, and this site is entered automatically here.
[0:05:09] So I go to 'New connection'. The important thing here is that 'Site ID' is exactly the same name that you used when you performed the 'omd create' on the remote site.
[0:05:21] In my example 'remote1'. Now I am entering 'Singapore' here as a comment, for example.
[0:05:27] Also important is the IP address of the server on which the remote site is running, and its port number, which was just queried in this mask at 'omd config'. The port number is '6557', which means that if there is only one site on this server that can be reached remotely, you can stick with this default port number.
[0:05:49] Encryption – of course we want to encrypt here, because security is very important to me, and we must now untick this checkbox from our test, because we have not stored a certificate that we have generated with our own CA.
[0:06:06] This means that this certificate used by the remote site is an automatically, randomly generated one, and therefore we unfortunately cannot verify it.
[0:06:15] Now that would be another topic you can look up in the user handbook to find out exactly how to do this if you want to store your own certificate.
[0:06:23] The Live Status Proxy is a feature of the Enterprise Edition that optimizes waiting times for very high latency connections, for example connections over the Pacific, and it also detects when remote sites are unavailable.
[0:06:40] What you certainly have to enter here is the URL prefix through which the remote – the web interface for the remote site – can be accessed.
[0:06:53] Now I need to enter the name of the site here, and then I come to the last box, which is the matter of Configuration Connection. Here you have two options – you can either say that each site will be administered locally, i.e. the colleagues in Singapore use their own Checkmk to set up their hosts themselves, and you actually only want to have a central status screen in Munich, from where you can run the operation or have an insight. The other variant, and this is actually more common nowadays, is that you really want to have a central configuration as well, which means that I make the whole monitoring system feel like one big system, and in Singapore nothing is actually touched but instead everything is handled from Munich.
[0:07:38] To do this, you would select 'Push configuration to this site' here, the central system – in this case the Munich system – then pushes its configuration to all remote systems, and also transmits the information from the monitored hosts to the central system. For this we need the same URL as above, but extended by the path component of Checkmk, namely '/check_mk/'.
[0:08:06] The rest can be left as it is, so simply save.
[0:08:12] The next step is to log in to the remote site, so to speak.
[0:08:17] So far we have only created a connection, but what we now need is the administrator access to this remote site, which you can enter by clicking here on this key icon.
[0:08:29] At the moment we have not yet logged in, so we can't distribute any configurations, but if I click here on the key, I can log in for a one-time session as 'cmkadmin' on the remote site – you will have to confirm again that the configuration that might already be there will be overwritten. In our case the site is empty, since we have only just created it. But if you have already been working there for a long time, possibly monitoring hundreds or thousands of hosts, these would actually be completely overwritten, i.e. to integrate an existing remote site, is then not automatically possible.
[0:09:03] If you want to take over the hosts that have been integrated there, this would have to be achieved in a different way.
[0:09:10] I now overwrite the configuration, press 'Login', and now I have established a connection.
[0:09:17] So I have the green tick here on the right. This is the tick for the configuration connection, i.e. for pushing the configuration, and the green tick on the left is for the status connection, which I need to be able to see the current monitoring states of the host and services at the remote site.
[0:09:31] So, to try it all out, we now simply want to add a new host to the monitoring – a host that will be monitored in Singapore.
[0:09:39] This means that we are still logged on to the Munich site and want to create a host that will actually be created in Singapore.And this is actually not at all so difficult.
[0:09:51] I open my sidebar again, then go to the host administration, and there I simply create a new host.
[0:10:04] I now name it, for example, 'cmksingapur', and would like to start by monitoring the actual Checkmk server in Singapore.
[0:10:14] Now comes the crucial point – here is an attribute 'Monitored on site', so that you can specify the location from which this system is to be monitored, so I select 'remote1', which means that the server on site will also make this request to the agent, and the agent can perhaps enter network areas other than ones accessible from my central site.
[0:10:40] The rest is the same as before – I enter the IP address, but of course this time the address I need to access the system directly from Singapore, and that is all I really need to do.
[0:10:57] Of course, as always, the agent must be installed on the system, then I can go back to my service configuration, and now it is already the situation that this service discovery is already using the system in Singapore.
[0:11:14] This means that the agent there is then contacted and the services to be monitored can be identified. It all looks exactly as if it were local, and this is intentional, because the whole system is supposed to look like a single large monitoring system. In other words, I simply instruct it to monitor all services, make my 'Activate changes', and there we can see a second change, which is that now we don't see just one line as before, but rather we see our two systems here, each with two lines. Now, whenever you activate changes, you will also always see this system, i.e. the remote instances on which there are changes that need to be activated. We have included a host located in Singapore here, so the site in Singapore is now marked to show that it requires a restart for its configuration. The local instance has this checkbox, showing it is up-to-date.
[0:12:09] We have not changed anything there either.
[0:12:10] So if I now do an 'Activate affected' here, it means that all sites that have changes will be activated automatically. So I just press here, and you can see at the remote instance that this bar is moving, and now again all instances are marked with the green check mark, and the change has been activated.
[0:12:31] If you now go into the list of hosts, you will see that a new host has appeared here in Singapore. The total number of 6 hosts we can see here.
[0:12:42] That's five on the 'mysite' site, and one in Singapore, and so we have constructed a distributed monitoring system, which feels like a large monitoring system, but which has the advantages of being scalable over many, many sites, it is fail-safe, you can monitor into secure network areas without monitoring through a firewall, and it also has many other advantages. Because you see a status for each site – this newest now appears below, and I see it right here in the site status – I have a local site, 'mysite' is its name, and I have the site in Singapore, and with the snap-in you can now do two things: First, you can see if this site is currently accessible – which I hope is the case – and you can also hide it.
[0:13:33] For example, if I now go to the view of my hosts, and now hide Singapore, then I will only see the elements from the local site, or I can do it the other way around, so that the local hosts hidden, and then I will only see the hosts in Singapore.
[0:13:49] It is important to know what actually happens if an instance is unreachable. Well, the basic principle behind Checkmk's distributed monitoring is that all monitoring data basically remains on the remote instance and is not being continuously transferred to the central system.
[0:14:06] This is an important principle, because only this allows maximum scaling.If you imagine that you could possibly have 100 remote sites monitoring a million hosts, it would be very pointless to always have to write all monitoring data to a central database.
[0:14:19] No database in the world would be able to accommodate so much data in such a short time.
[0:14:25] That is why Checkmk ensures that the monitoring data always remains on the remote sites, which has the advantage that it scales very, very well, and it has the further advantage that the remote site can function independently even if the central site is not reachable. This means that if a remote instance is currently unreachable, you will not see outdated data, but rather no data at all. It is simply not visible.
[0:14:52] You will see an error in the Site status, and a red error bar will appear in the status views to indicate that the data is incomplete.
[0:15:00] When it comes to monitoring, you can always say that incomplete data is just as useless as outdated data. So, if the remote site was completely missing, it wouldn't do much good anyway to know what the status was like an hour ago, rather I'm always interested in the current status, but you have to be aware of that with Checkmk distributed monitoring. The advantage now is that as soon as the remote site is reconnected, you immediately receive the current status, so there is no backlog of any monitoring updates, which have to be pushed to some central site via – I don't know – Messenger, but you will immediately receive up-to-date data from the remote site, and without any time delay.
[0:15:38] And so, for the time being, we are finished with distributed monitoring.
[0:15:42] There are of course also many other interesting aspects, such as what happens with alerts, what happens with the Event Console, and so on. And of course you can find all of these details in a very detailed article on distributed monitoring in our user manual.
[0:15:55] And, I advise you to have a look there even if you use the Raw Edition, because then you should include a so-called status host when configuring the remote sites, so that in case of a failure the user interface does not wait forever for data, but recognizes the site as faulty and continues to process normally.
[0:16:13] So, thanks for participating, I hope it was interesting for you, and hopefully we will see you for the next video.

Want to know more about Checkmk? Join us for our Introduction to Checkmk Webinar

Register now

More Checkmk Videos