Ep. 14: Distributed monitoring with Checkmk

To load this YouTube video you are required to accept advertising cookies.

[0:00:00]	Welcome back to the Checkmk channel. In this episode, we're taking a look at distributed monitoring.
[0:00:14]	Distributed monitoring means that you use multiple Checkmk servers to create one large monitoring system and there are multiple reasons to create such a setup.
[0:00:24]	One could be scaling for example when you have more hosts than you can monitor on a single server. Even though Checkmk's performance lets it scale very well if you have a very large number of hosts then you might get to the point where you need a second Checkmk server. At that point, you can choose to create 2, 3, 4 or as many Checkmk servers as you need and connect them all together to create one large monitoring system.
[0:00:55]	Another reason could be that separate Checkmk servers are managed by different parts of a company so for organizational reasons. But in the end you still want one complete overview of the entire monitoring of the company.
[0:01:07]	Then there is networking, let's say that you have a location in Munich and one in New York. If you would work from New York but your monitoring system is located in Munich and every time you would query a host located in New York, data with ping-pong over the Atlantic ocean four times.
[0:01:29]	Instead, you could set up a local server in New York, which would monitor all the hosts in New York and then connect them together using distributed monitoring to monitor both locations.
[0:01:42]	This would also cover the issue of availability. You want your monitoring system to keep working even though that there is no connection available to a remote site.
[0:01:54]	General security: it might be the case that you cannot monitor location A from location B for security reasons or that there is a DMZ, a secure area where monitoring access is not desirable.
[0:02:09]	In that case, you could create a Checkmk site for that secure area and integrate that into your larger monitoring system. And how this all works, and how you can create a distributed setup using multiple Checkmk servers is what I will show you.
[0:02:25]	Our starting point will be this normal Checkmk site with 6 hosts that we have used throughout the series.
[0:02:33]	Now the next step will be to create a remote site, so a second site that we then want to integrate into this one. To do this I will switch to the command line of our Checkmk server.
[0:02:46]	Now normally this would be your remote server but in this case, it's the same server as our main one. Because it's just an example and it will work just fine.
[0:02:57]	So now let's create the server with the 'omd create' command. So 'omd create' and let's stick with the example of New York and call this one 'checkmk_new_york'. Okay and now that this is done we can change the password into something which is easy to remember.
[0:03:28]	So we log in using 'su -checkmk_new_york', and now we can copy this command, select it and copy paste it with the middle mouse button. And set your password.
[0:03:59]	Now the first important step is to allow access from a remote location. The way this works is that the main site connects on a specific TCP port, in order to fetch the status data from the remote site. To do this we use the command 'omd config'.
[0:04:20]	Then navigate to distributor monitoring. And here we activate the LIVESTATUS_TCP.
[0:04:33]	Now we should also configure an IP address here and this is the IP address from which this server is reachable, you can also setup encryption here but this is already enabled by default. I will leave the IP restriction for what it is now but normally it's recommended to set this to the IP address of your main Checkmk server. Now we can go back to the main menu. And exit.
[0:05:07]	And the only thing left to do is start the site so, 'omd start'. We are now at the point that this remote site is ready to be included into our monitoring.
[0:05:25]	It's important that you write down or remember the port on which you will be connecting. The default is set to 6557 and I have not changed it but if you are going to connect multiple Checkmk sites to one main site and of course you're going to have to configure multiple TCP ports.
[0:05:45]	All further steps can be done directly in the interface of our main site. To do this we go to the setup menu and search for distributed monitoring, here we can add a new connection to our remote site. You already see our main site here this has been added automatically to add our remote site click on add connection.
[0:06:10]	For our site ID, we need to enter the name of the remote site exactly how we typed it while running the 'omd create' command. So in our case checkmk_new_york. And for the alias we'll use New York.
[0:06:34]	Important as the IP address of the server on which your remote site is running and also the the TCP port which you saw in the menu when we ran the 'omd config' command.
[0:06:54]	And to keep everything nice and secure we of course want to use an encryption but in our case we are going to have to uncheck this box here.
[0:07:05]	We have to do this because we have now generated a certificate with our own ca this means that a certificate on our remote site is automatic randomly generated one and therefore we can unfortunately not verify it.
[0:07:16]	If this is something you want to set up for your own monitoring system then you can find all the information in our documentation. The live status proxy is a feature of the enterprise edition that optimizes waiting time for very high latency connections.
[0:07:33]	For example over the Atlantic ocean and then also recognize or detects when remote sites are not available. I will also have to configure the URL prefix here on which the user interface of the remote site will be available. In our case that will be the IP address followed by the name of the site.
[0:08:08]	Next thing we need to go to this last section configuration connection. And here we have two options you can either say that the site is managed from its own location. So admins in new york manage their own Checkmk to setup host themselves.
[0:08:25]	And then in Munich you just have one central status overview from where you can run operations and have all the insights.
[0:08:32]	The other variant which is more common nowadays is that you have a central configuration, so that your monitoring feels like one big system and then in new york nothing is really being touched but everything is handled from within the Munich side.
[0:08:50]	To set that up you have to choose here for push configuration to this site. This means that the configuration from your main site is being pushed to this remote site. The next thing we have to do is to set up the URL of the remote site.
[0:09:05]	This is the same as we had before so we can copy that the only thing we need to add is "check_ mk /" the rest we can leave as this and simply press save.
[0:09:25]	Next thing we need to do is log in to the remote site.
[0:09:29]	Until now we have only set up a connection but we need administrator access to access this remote site and we can actually push our configuration then to it. To do this we need to set the credentials for a one-time session and you can do this when you click on this key icon here.
[0:09:23]	The credentials will not be stored but only used this one time to perform an initial handshake and we'll use the cmkadmin user for this. We also need to confirm that we want to overwrite the configuration of a remote host with that one of our main site.
[0:10:12]	In our case our remote site is empty because we have just created it but if you would have already worked in your remote site for a long time and have hundreds of hosts in there then those would be overwritten.
[0:10:24]	So to integrate a site with existing host this way is not automatically possible, if you want to take over existing host then you would need to do that in a different way. So let's type in our password and select the checkbox to override the configuration. Now press login.
[0:10:49]	And you will see that we have now set up a connection. The check mark on the right is for the connection of the configuration, the one on the left is the status connection used to get the states of all the hosts and services.
[0:11:07]	And to try it all out we are now going to add a new host to the monitoring system, which should be monitored in New York. We are currently still logged in into our main site in Munich, but we want to add the host that is actually created in New York and now that that is very easy.
[0:11:28]	We simply go to setup and then hosts and we will create a host like we would normally do to add host the name will be checkmk_new_york. So this will be the monitoring server itself in New York. Now the important thing is this attribute here 'Monitored on site', and here we pick on which site this host should be created. So in our case New York.
[0:12:03]	That means that this remote site will retrieve the data from the agent and it's also possible that this remote site can access host in its network that the main site cannot access.
[0:12:16]	The only thing left to do now is type in the IP address, and press Save & go to service configuration.
[0:12:32]	And now it's actually the host in new york that is contacting the agent and is performing the service discovery. And this looks pretty much the same as it would if you add a host to a single site setup and this is on purpose because we want to make it feel like one large monitoring system.
[0:12:52]	So let's press fix all to monitor everything and activate the changes. And now here you also see a change, there is a second row for the second site that we added.
[0:13:17]	We added the host to the New York site that's why you see this icon here indicating that there are changes to be activated on this site while our main site is still up to date indicated by this green check mark here. Now let's press on 'Activate on selected sites '.
[0:13:40]	Then only those sites will be restarted which have pending changes. And like you can see now both sides have a green check mark indicating that everything is up to date. And when we now go to all hosts you see that there is a new host here in New York.
[0:14:09]	And in the overview, you see now that there are seven hosts six in our main site and one in our New York site. And like that we have configured a distributed monitoring system which feels like one large system and it has the advantages of being scalable simply by adding new sites, it's failsafe you can monitor into secure network areas without going to a firewall and many more.
[0:14:35]	There is also a useful sidebar element or snap-in for when you use distributed monitoring let's go to a sidebar and click on the plus icon and we now scroll down there is an element or snap-in called site status we can add it by clicking on it.
[0:14:59]	And besides showing the status of each of our sites we can also use it to toggle on and off the information or data of this site. So let's go back to our host view. And if we now toggle ON or toggle OFF New York. You'll see that the view is updated and we don't and all the data from it from this site is excluded.
[0:15:31]	It's important to know what happens when sites become unreachable, one of the important principles of distributed monitoring is that all the monitoring data remains on the remote sites and it's not being transferred to the main site the entire time and this is important because this is one of the reasons that it's very scalable.
[0:15:46]	If you would for example have 100 sites each with a million hosts then writing all that data into a central database is quite pointless. There is no database in the world that lets you write such a large amount of data in such a short period of time. So this not only ensures scalability but it also makes sure that your remote site can keep working independently if ever the connection to the main site is unreachable.
[0:16:18]	Or on the other hand if the remote site becomes unreachable then you won't see any outdated data but rather you won't see any data at all it won't be visible and the status of the connection will say that it's dead and that the data is incomplete.
[0:16:29]	So when a remote site is unreachable it's quite pointless to see the status of one hour ago and that's why we don't show you any outdated data.
[0:16:43]	But rather we say that there is a problem with the connection and as soon as the connection becomes available again you will also receive the current status. So there is no backlog of monitoring updates but instead you will see up-to-date data of the remote site as soon as it's available.
[0:16:54]	So that was it for distributed monitoring and of course there are many other interesting aspects to it like what happens to alerts? What happens to the event console? And all of that you can find in our documentation.
[0:17:11]	Thanks for watching I hope this was interesting to you. If so subscribe to the channel, like the video and I hope to see you in the next episode.