Ep. 17: Working with Network topologies in Checkmk
Note: All the videos on our website offered in the German language have English subtitles and transcripts, as given below.
[0:00:01] | In this episode, we look at the Parents and the network topology. Welcome to the final episode in our first season of Checkmk tutorials. |
[0:00:20] | The subject for today is Parents and network topologies – and what that is all about and what you can do with it, I'll show you in a moment. Please do not be concerned, there will of course be a second season in the future. We will be shooting that with Checkmk 2.0 and with a completely new user interface. |
[0:00:37] | Network topology – what can I actually do with that? Let's just have a look at it now, here in our test system. Here we have a small Checkmk system with eight hosts. |
[0:00:49] | This is also not very exciting – here are five servers, the Checkmk server is also there, two routers, and a Windows computer that is currently down. |
[0:00:58] | Now if you go to this ‘Network Topology’ item – you may have done this before – you'll see a funny cloud of hosts like this one here. I will just drag it to the center. What does that now tell me? |
[0:01:12] | Not much – I still just have my hosts, and this one here is down.I collapse the sidebar, so that I can see more. ‘Windows 01’ is DOWN, or for example, here this database server is UP. |
[0:01:24] | In itself, the graphic is otherwise relatively meaningless. How can I now fill it with life? |
[0:01:30] | Now how can I set up a typology here, and why should I do that anyway? |
[0:01:34] | In our example, not only do we have the servers, but I also have two routers here. Now let's assume that the database servers are accessible via ‘router01’. |
[0:01:46] | What would happen if this router failed? |
[0:01:50] | Of course, then the situation would be that the database servers will also no longer be accessible for monitoring. |
[0:01:56] | That’s why everything would go into a DOWN state, and you would get a whole series of alerts, not only from the router, but also from the database servers. |
[0:02:04] | What would actually be more useful is when an alarm would come from the router that is no longer reachable, including the information that the database servers located downstream are only inaccessible to the monitoring. |
[0:02:16] | For this, in Checkmk there is the status Unreachable, that is, it cannot be contacted. Normally, there would be no alarm, since an alarm has already been triggered for the router. |
[0:02:26] | By the way, it is quite possible that the database servers still work wonderfully, and that these are even accessible for the users, but just not for the monitoring system. |
[0:02:33] | For this to work, however, you must tell the Checkmk server that it can access the database servers via ‘router01’. For this purpose, there is the ‘Parents’ field in the host configuration. |
[0:02:47] | To do this, I now go to the host administration, then I go to my servers and choose one of the database servers. |
[0:02:58] | For example, this one, and this I now edit. Next I go to the ‘Parents’ field, activate it, and enter ‘router01’. |
[0:03:11] | Of course I will Save, and activate the change. And now when I go back to the Network Topology, I can see that there is a small difference from before. |
[0:03:26] | Let’s wait until this settles down. Collapse the sidebar. |
[0:03:30] | I’ll move it back over to the right-hand side, then you'll see that the ‘db-01-server’ is now located behind the ‘router01’. You can see these moving lines here – they indicate the direction in which the monitoring data is flowing. |
[0:03:46] | Here, from the ‘db-01-server’ to the ‘router01’, and from there to the Checkmk instance. The Checkmk logo is representative of the entire system. |
[0:03:56] | Now Checkmk knows that to reach this server, it has to go through this router. |
[0:04:03] | If the router should fail, it will know that it cannot actually reach this ‘db-01-server’. Let’s do this for all of the database servers. To do this I go back to my Host administration, to my servers. |
[0:04:19] | I do the following now – I go back to this first database server, and deactivate ‘Parents’, because it is much more intelligent to specify this by using the folder. |
[0:04:30] | As you can see, I have all of the servers in a single folder. |
[0:04:33] | If these are now all accessible via a router, I can simply go to the properties of the folder here and enter ‘router01’ as the parent in the folder. |
[0:04:46] | Activate the changes. The five servers now hang in the topology for this router. |
[0:04:54] | How can I try it out now to see if it works? |
[0:04:58] | I could simply switch off the ‘router01’ off at this point, but since this is a test system and the router doesn’t actually exist, I can’t do that. |
[0:05:07] | But here’s what I can do now – in the monitoring, I go here to this router. |
[0:05:12] | I click on it here, and can go to ‘Details of Host’ here with the right mouse button. |
[0:05:19] | Now I am at the service list for this host. I go to the host itself – to the commands – and now I do two things: One is that I deactivate the active checks. |
[0:05:32] | I always select ‘disable active checks’. This simply leaves it in the state it was in before. |
[0:05:40] | Then I manually set this state to Down. This means I am performing a ‘Fake Check Results’ here. |
[0:05:47] | I can also enter ‘TEST’ here and simply set this host manually to Down. |
[0:05:56] | Because I have deactivated the active checks, it also stays on DOWN, and does not go back to UP after a few seconds, since it is actually still accessible. |
[0:06:06] | Now when I go again to the network topology, I can see the following: This ‘windows01’ host, which was previously on DOWN, is now suddenly marked as UNREACHABLE. |
[0:06:20] | This means that Checkmk now knows that the router is not actually responding, but the reason for that is probably that a router located somewhere on the connection path is also not responding. |
[0:06:30] | Now you will probably be asking yourself – “Why haven’t the other four servers gone to unreachable as well?” |
[0:06:35] | In reality they would, of course, but in my test system these hosts are pingable and that is why they are UP. |
[0:06:44] | This means that if Checkmk finds that it can still reach the servers even though the router is DOWN, the status will still be OK. |
[0:06:53] | In any case, Checkmk still tries to connect to these systems. The situation that we have here, where the router is down and the hosts are green, cannot actually happen in practice if you have set it up correctly. |
[0:07:06] | What if you have a system in which there are redundant routers? |
[0:07:11] | One where a target can be reached via two paths? |
[0:07:14] | In that case, if one of the two routers is still working, the target device should also be accessible. You don’t have a tree structure, so to speak, instead you have a meshed network, and that can also be mapped in Checkmk. |
[0:07:26] | To do this, I go back to my host administration, again to the folder – the properties. If you had been watching carefully before, you would have seen that there are several input fields for Parents. |
[0:07:39] | I can also enter a second router here. Save, and activate the changes. If I now go to my network topology, it will look a bit more confusing. We close the sidebar. If you look closely, you can see that each of these computers can be reached via the alternatives A ‘router01’ or B ‘router02’. |
[0:08:10] | This means that it won't take long for this host to be downgraded from UNKNOWN to CRIT status, or to DOWN, because Checkmk says that one of the routers is active, namely ‘router02’, so therefore this host must be reachable. |
[0:08:26] | If the host does not answer, then it is not the fault of the router, but of the host itself. That means that if we wait for a short time, the whole thing will go to red. |
[0:08:35] | So as you can see, here in Checkmk you have a method for mapping the network’s topology, primarily for avoiding any unnecessary alerts. When we talk about topology, you always have to see, of course, that it is a matter of the monitoring system's point of view – the way by which Checkmk can reach a server, and is not concerned with your network topology in general. |
[0:08:59] | In distributed monitoring, of course, each Checkmk instance has its own topology, since monitoring takes place locally from the instance. It is not about mapping your global network structure. |
[0:09:13] | If you use the tool correctly, you will avoid unnecessary alerts. It is absolutely recommended, especially if you have sites with many hosts that can only be reached via a few routers. |
[0:09:24] | With that, I would like to thank you for joining me for this first series of Checkmk tutorials. I hope to see you again in the second series. |
[0:09:37] | And besides, I've also now run out of shirts.... |
Want to know more about Checkmk? Join us for our Introduction to Checkmk Webinar