Checkmk 2.0 is here! See what's new.

Ep. 13: Working with Schedule Downtimes

Note: All the videos on our website offered in the German language have English subtitles and transcripts, as given below.

[0:00:00] Welcome back to the 13th installment in our tutorial series.
[0:00:03] Our subject for today will be Scheduled Downtimes.
[0:00:18] Maintenance times, what is this all about? Well, our monitoring is there to detect problems, i.e. failures of hardware or software or, let's say, failures of hosts or services.
[0:00:27] Now, however, there are some outages that don't simply happen, but which you already know about in advance, specifically those related to maintenance.
[0:00:35] For example, if you do a firmware upgrade on a switch, this switch will simply be unavailable for a short time.
[0:00:41] Now you can enter these planned maintenances in the monitoring, and then some particular actions will take effect: On the one hand, these problems are marked with an icon in the Monitoring, they also disappear from the list in the Problem View or the Tactical Overview, no alarms will be triggered during the maintenance period, and when you later perform an availability analysis these planned maintenance periods will be calculated differently from unplanned outages, if you have chosen to do so.
[0:01:13] However, there is also an alarm at the beginning and end of a maintenance period, in which a special notification is sent out to any people affected, so that they can be kept fully informed. We will now look at how to do this – the setting up of Maintenance Times.
[0:01:28] This is in principle very similar to the Acknowledgements, so if you have already seen the previous episode, then you will know this by now, but here the command is not called Acknowledgements, instead it is called Scheduled Downtime.
[0:01:41] To set a maintenance time, first find the host or service you want to set the maintenance time for. It's very easy to know that when a host is set to maintenance – all of its services will be automatically considered to be in maintenance.
[0:01:54] As an example I select one of the hosts – let's take the 'dbserver01' for example. So I'm going to go to that one – I need need to be careful now – I then come to the list of its services, but I don't want to set a maintenance time on the services, but instead here in this case on the server itself.
[0:02:08] So now I go here to the host name, I go to the details for this host, and now I go back to my hammer in order to execute a command.
[0:02:18] I close the first box – 'Acknowledge' – I close the sidebar to have more space – and so now in the second box you can see 'Schedule Downtimes'.
[0:02:28] The most important thing to enter now is again a comment, for example 'Need firmware upgrade', and then the second most important entry is the time when this maintenance period should be started.
[0:02:44] You can for example specify starting now, and running for 60 minutes, 2 hours or all of today, this week, this month, even this year, or you can also specify a time range in the future.
[0:02:55] For example, if you say it is not now, but rather in an hour, or let's say a time range from 14:00 to 16:00, you can of course enter those times here as required.
[0:03:10] Let's take a simple example, in which we start the maintenance from now and running for the next 2 hours, then I just have to press this button here, then I get the confirmation again, say 'yes', return to this view and now here you can see a small symbol, indicating that this host is in a planned maintenance, which has just started.
[0:03:32] And thus, as previously mentioned, the alarms are switched off, and if this host was now DOWN – which by the way does not have to be the case – it would not be displayed here with the 'Problems'.
[0:03:43] If you want to learn more about the downtime or the maintenance time, just click on this icon, and then you will get the list of downtimes, or to be more specific, the list of scheduled downtimes for the host dbserver01 – which is usually only one, and this is now shown here.
[0:04:01] So I can see the comment that I entered, and I can also see when the downtime plan was created and when it started 162 seconds ago and that it will end in 117 minutes.
[0:04:11] By the way, as always you can change the date format with this icon here, if this relative timestamp is somehow confusing to you, you can specify that you want to have absolute timestamps, and then you will see it displayed like this here in this view.
[0:04:27] So, if you have the list of downtimes, be it from a host or from services, here again there are commands for this. And you can also do two interesting things: one is to remove the downtime here with the 'Remove' button.
[0:04:43] But you can also adjust the downtime, and here one very popular option is to extend the time by saying, for example: this maintenance time is not enough for me, I will need more time, so you can say okay, give me one more hour.
[0:04:56] The maintenance time is then simply extended extended by the additional hour. Let's return again to the mask in which you set the properties for the Downtime.
[0:05:06] Here there are still a number of interesting options that I would like to show you. So, I'll simply return to the host, and again go to this command here, and, again collapse the sidebar.
[0:05:23] So, we have seen that there are comments, there are these time periods, there are some other very interesting options here.
[0:05:31] The option 'flexible with max. duration – that is a very interesting thing, and that is, if you say you don't know exactly when you will need the maintenance time, then you can say that within the next two hours this device will go into its maintenance time, and the maintenance time will start exactly when this unit goes into a DOWN state. This is now hidden behind that – that is, if I tick this box now, the maintenance period will not start immediately, but only when the object actually goes to critical, or WARNING, and that change of state must happen within the next two hours for the maintenance period to commence.
[0:06:18] Here there is another option which is interesting if you work with routers or switches.
[0:06:24] This means, for example, if you have a router waiting, and behind this router there are many other hosts that can only be reached for monitoring via this router.
[0:06:33] In such a case, alarms would of course also be triggered for these, or these hosts would be reported as DOWN, and by ticking this option you can say: The maintenance time automatically applies to the entire network area which is accessible via this host.
[0:06:48] And this can be done recursively, so then it goes across all levels that really manage all hosts that are accessible via the router, indirectly or directly.
[0:07:01] The third option is again a very special feature – you can also set set maintenance times so that they are repeated regularly.
[0:07:11] There are two methods for this, one method is via rules, and the other method is direct. I'm just going to show this direct method, where you could say: that this maintenance time should be repeated automatically – for example, every week or every day.
[0:07:23] This means that the times you set up here above will simply be repeated every day or every week, so that with this you can set up a regular maintenance time, which is very interesting, if for example you want all of your Windows servers to be rebooted at a specific time on a specific day, once each week, you can set this up as a maintenance time.
[0:07:45] But this is only possible with the Enterprise Edition , and for this reason it says "This only works when using CMC", which is the monitoring core in the Enterprise Edition.
[0:07:55] If you however have many hosts that need regular maintenance time following the same principle, it would be very cumbersome to have to select them all with a command and then set these downtimes manually.
[0:08:06] In particular, if new hosts were to be added, this would not apply to them. That is why there is a second method for doing this, and that method is through rules.
[0:08:15] All you require is the appropriate rule-set, which as always you can find under 'Host & Service Parameters'.
[0:08:21] Now I just search here for the word 'downtime', and find 'Recurring downtimes for hosts' or 'for services', and here I can create a rule that works according to the same basic principle, just like all of the other rules.
[0:08:36] We have already made a couple of videos on the subject of rules. Now you can again simply enter here when the downtime should start and end, and at which intervals it should be repeated.
[0:08:49] So you will now benefit from the advantage that this rule not only applies to the objects that were already present, but if a new host is added to the system for which this rule applies, it will automatically be given the same regular maintenance time, so that you will always be on the safe side.
[0:09:04] Finally, I would like to show you how to gain an insight into the recording of maintenance times, i.e., when did maintenance times start and when did they end.
[0:09:13] There is a separate view, which you can find under 'Views' 'Other' and there you can find 'History of scheduled downtimes', and here you can see, for example, that for our host 'dbserver01' a maintenance period started 12 minutes ago – we didn't do more than that in this test system – but of course this list will get longer and longer and you will be able to see exactly when maintenance periods started and ended.
[0:09:46] So, and that's it for today, thanks for watching and we'll see you next time for the topic of Distributed Monitoring.

Want to know more about Checkmk? Join us for our Introduction to Checkmk Webinar

Register now

More Checkmk Videos