Checkmk Conference #6 goes digital. Get your tickets here!
1.1. What do SLAs do?
In Checkmk you can evaluate availability, and for these also configure a rudimentary SLA evaluation. Now the absolute availability over a given period is not particularly meaningful. To take an extreme example: an availability of 99.9 percent would allow 87.6 hours downtime per year. Distributing these 87.6 hours over a year averages 0.24 hours per day – which may be acceptable for many systems. A full-time outage could however cause a lot of damage.
From version 1.5.0 Checkmk has a separate function for Service Level Agreements. The SLA feature builds on the availability data and now allows a much more detailed evaluation. Two different requirements can be implemented:
- Percentage that a service status (OK, WARN, CRIT, UNKNOWN) is above or below a given value.
- Maximum number of ‚failures’ – more precisely the status CRIT, WARN, UNKNOWN for a given duration.
You can combine multiple instances of one or both requirements – for example, to ensure that during the reporting period a particular service is at least 90 percent OK, and that a maximum of two events of two or more minutes are permitted to flag the CRIT condition.
The results of these calculations can later be rendered in two variants in views:
- Service Specific: A service will display its associated SLA.
- Column-specific: A fixed SLA is displayed for each service.
For example, here in the overview the evaluation of the last 15 days for one file system can be seen - and it can also immediately be seen that obvious problems have existed for two days.
But what do these evaluations bring you now? On the one hand you can see the fulfillment or non-fulfillment of completed SLAs, and for example, disclose these to the customers. On the other hand, you can already identify an impending failure in advance: By default, the SLA indicator will light up CRIT when its value has been exceeded. But it can also be adjusted so that it already goes to CRIT if, for example, the permitted CRIT status for the service has reached 80 percent – and before that point it could change to a WARN flag.
Ultimately, SLAs are mostly very detailed views, generated from charged availability data. You will see these later at two points: in tables, optionally for all hosts and services listed there, or only for services that are specifically tied to individual SLAs. Secondly, there is a comprehensive detail page for each service-SLA combination. Because of this proximity to views, the SLA feature is also embedded in the views configuration.
The data basis for the SLA function is the availability data. The calculations on the SLA specifications are, of course, not applied to the entire raw data inventory – after all, SLA reports should cover specific time periods. So first it is determined for which period the SLA requirements should be met – the so-called ‚SLA period’. For example: Service MyService within a one month period should be at least 90 percent OK. For this SLA period not necessarily (inevitably) all data for the month over a 24/7 operation will be used. The data can be converted to the Time periods defined in WATO, i.e., be limited roughly to working hours.
This then results in a request such as: The service MyService over one month (the SLA period), during working hours (the Time period) – Monday to Friday from 10:00 to 20:00 – should be 90 percent OK (SLA request). The SLA and time periods complement each other, where of course, the latter does not have to be restricted: Of course you can use all of the monitoring data generated over the SLA period.
In summary: You need two periods to restrict the data basis for the calculation of an SLA requirement:
- SLA Period: The period (e.g., weekly) agreed in the SLA that forms the basis for the report.
- Time period: Active Time periods from the WATO configuration (business days, more or less)
For each SLA period an independent result is generated. How many of these individual results you see in a table can be configured via views. So, for example, the last five weeks, limited to business days, shown as five individual SLA periods directly on the hosts and services.
As usual with Checkmk, between the data source (SLA definitions) and the output (view) there is still a rule to assign for SLA-specific services - but this is not a must. And thus for SLAs this usually results in a three-step process, when they apply to certain services:
- Define the SLA via Views ➳ Edit ➳ SLAs.
- SLA via the WATO ➳ Host & Service Parameters ➳ Grouping ➳ Assign SLA definition to Service bind to hosts/services (optional) rule.
- Create or adapt views for the SLA as required.
Here's how to set up a simple SLA including a view: The file systems of the hosts MyHost1 and MyHost2 over a reporting period of one week are required to be in an OK condition for at least 90 per cent of the time (here in this example a maximum of 80 percent has been reached). In addition they are allowed to assume the WARN condition for two or more minutes for a maximum of five occasions.
2. Setting up SLAs
2.1. Creating an SLA
First, create the actual SLA. The configuration menu can be found via Views ➳ Edit ➳ SLAs.
Create a new SLA via the [icon_new_sla.png] ICON. In the General Properties section, first give a unique ID, here MySLA,, and a title, such as Filesystems.
Under SLA-Settings set the SLA period to the desired period, such as Weekly. The following requirements are therefore always valid for the period of one week.
But you can, before you set up the actual requirements also add further restrictions under Filtering and computation options, and set options which are not however needed for our simple example SLA:
|Scheduled Downtimes||Consideration of planned Scheduled Downtimes.|
|Status Classification||Consideration of Flapping, Downtimes and times outside the monitoring times.|
|Service Status Grouping||Reclassification of Status.|
|Only show objects with outages||Show only objects with given default rates.|
|Host Status Grouping||Consideration of Host-Status UNREACH as UNREACH, UP, DOWN.|
|Service Time||Consideration of Serviceperiod.|
|Notification Period||Consideration of Notification Periods.|
|Short Time Intervals||Ignore intervals shorter than a given duration, so that brief interruptions are ignored (similar to the concept of Soft states).|
|Phase Merging||Directly successive reporting periods of the same status should not be amalgamated.|
|Query Time Limit||Limiting of the query time as a solution for slow or non-answering systems.|
|Limit processed data||Limiting of the data lines to be processed; standard is 5.000.|
Next specify the actual requirements in the SLA requirements menu. Provided these have been specified in WATO time periods, they can also be used in the SLAs like above as already mentioned under general availability. Select a desired Period under Active in timeperiod, or as here in the example Always to define the requirements for a 24/7 operation.
Under Title, give a meaningful name, say 90 percent OK.
For Computation Type leave the first request as the default service state percentage and insert an additional new criterion via . A new paragraph opens for the Monitoring state requirement: To ask for at least 90 percent availability, set this record to ‚OK, Minimum, 90‘. If this value falls short, the SLA is considered broken and assumes the status CRIT, as will be seen later on the results page.
Perhaps the SLA should not wait until it breaks and then go to CRIT, but directly to WARN as soon as 50 percent of the buffer is consumed, and then to CRIT if there are 10 percent buffers left. The thing really breaking the SLA would then produce broken but no further status changes (more on this in the section ‚detail page‘ below). For such a configuration check the box at Levels for SLA monitoring: Here you can enter residual values for the transition to WARN and CRIT. This completes the first request.
Now add a second request , and again set the time period – specify as the name under Title for example a maximum of 5x WARN, each of 2 minutes and this time set the Computation Type to Maximum number of service outages. The real requirement is then: Maximum 5 times WARN with duration 0 days 0 hours 2 mins 0 secs. According to the SLA, per SLA period the service is now allowed the specified status on a maximum of five occasions, with each occasion for a maximum of two minutes, without the SLA being flagged as broken. Instead of WARN another status could of course also be specified at this point. And again via the Levels for SLA monitoring you may also refine and determine how many remaining incidents will trigger a warning, before the SLA actually breaks with a WARN or CRIT.
As mentioned earlier, you can add more of these requirements and knit detailed SLAs together. But there are still no services that ‚react‘ to this SLA - in our example, a rule must make this connection. As you are using the configuration created so far without such an SLA service connection, read the section Column-specific SLA display below.
2.2. Linking an SLA to a service
The SLA is connected to a service via WATO ➳ Host & Service Parameters ➳ Grouping ➳ Assign SLA definition to service. Create a rule, enable the only rule-specific option Assign SLA to Service, and then choose your SLA definition from the pop-up menu MySLA, listed here by their title Filesystems.
Next under Conditions in the Services section set further filters for the desired services. As always you can work here with regular expressions, and as in this example link the SLA definition to all local file systems via Filesystem. *. Optionally you can still restrict everything using the rule-specific filters for folders, host tags and explicit hosts; in our example they are the hosts MyHost1 and MyHost2.
Of course at this point you could also omit any service filtering and simply bind the SLA to all services. How and why it is better to do that with a column-specific SLA view can be seen in below.
2.3. Integrating an SLA in view
So you have now created the SLA definition MySLA, and tied it to all services for the two hosts that start with Filesystem. Now create another new view for the SLAs. For the SLA example a simple view of the two hosts with their file system services and SLAs should be sufficient. For clarification, still to come are the Checkmk services to which no SLA is currently tied.
Create a new view with Views ➳ Edit ➳ New. In the first query specify All services as Datasource. For the following query, whether to show information from a single host or service, just confirm without making a selection.
Under General Properties, enter an id – here MySLAView_Demo – a title, such as My SLA Demo View, and ultimately a topic like MyTopicSLA if you later wish to have all of your SLA views under your own nodes in the views navigation. All other values can be left unchanged while testing.
Now navigate to the Columns section and initially using merge the three general columns Services: Service state, Hosts: hostname and Services: Service description as the basis for the view.
The column selector also contains two SLA-specific columns: Hosts/services: SLA - service specific and hosts/services: SLA - column specific. The latter shows a fixed SLA definition for each service in the view - the better alternative to an SLA for all services as mentioned above. More on this later. Add the Hosts/Services: SLA - Service specific column at this point. Here all sorts of options are now available for the presentation of the SLAs’ results.
SLA timerange: Use this to set the time frame for for which you want to see SLA results. For example, if you have the reporting period monthly in your SLA definition and here Last Year, you receive twelve individual results. In this example the SLA periods option is used to count the number of times displayed reporting periods can be set directly: For five periods/results set Starting from period number to 0, and Looking back to 4.
Layout options: By default, this option is set to Only Display SLA Name. To actually see the results of the SLAs, choose here Display SLA statistics. You can display up to three different elements:
- Display SLA subresults for each requirement separately displays each affected SLA with its name.
- Display a summary for each SLA period shows a graphic summary under the Aggregated result label.
- Display a summary over all SLA periods: Shows a textual, percentage summary of all SLAs under the Summary label.
For the current example, activate all three options.
Generic plugin display options: At this point define for the display of Outage/Percentage SLAs whether summaries (texts) or respectively individual results (icons) of the reporting periods should appear. To see both in action, in Service outage count display options select the aggregated info over all SLA periods entry, and leave the option for percentage SLAs on Show seperate result for each SLA period.
If you want to group the view by individual hosts, optionally under Grouping add the column Host: Hostname - which ensures a visual separation of the hosts.
Because the view should show only the hosts MyHost1 and MyHost2, in the last step – still in Context/Search Filters – under host set a filter for the hostname: MyHost1|MyHost2. For a slightly clearer example view you can still set a filter under Services, for example file system.*|Check_MK *. So you then get the SLA-monitored file system services, and as an unmonitored counterpart the Checkmk services – in this way the effect from using the service-specific SLA display will simply be clearer.
As a result you will get a view with five status icons as single results from the Percentage SLA, and a summary in the form of 100 percent for the Outage SLA. Of course only in the lines for the file system services – the Checkmk lines remain empty.
3. Further views
3.1. Column-specific SLA displays
The service-specific view has a big disadvantage: you can indeed create multiple rules that assign the same service to different SLAs, however you can only display the SLA assigned to the first of these rules – there is no way the SLA of a second controlling rule can be displayed in a second column.
But you can show several columns with different fixed, specified SLAs very well. Such column-specific views are useful, for example, if you need multiple SLAs which should apply for all services of some or all hosts. So it could be about defining something like gold, silver and bronze SLAs, each in a separate column next to the services of a host. Then at a glance it will be clear which SLA definitions a server/service meets. In short: the column-specific view allows you to display more than just a single SLA for services.
In the example completed above, those three steps mentioned at the beginning were executed - create SLA, bind to service, install in view. For column-specific views you can just leave out the second step. Create only the SLA, and arrange a view with the Hosts/Services: SLA - Column specific column. The SLA results will then be displayed in each line independently of the respective service.
The following image shows the above SLA view for MyHost1, with an additional column each service’s SLA results (maximum of three outages of Checkmk services); thus is the difference between service and column-specific indicator’s is clearly visible. What should also become clear: the SLA designed specifically for Checkmk services of course makes only moderate sense in the file system columns. It is worth planning thoroughly before beginning the implementation!
One more small note: In the options for the service-specific views, above under Generic plugin display options we have seen the settings for outage and percentage SLAs. In the options for the column-specific views you can see these two as well - but only if the SLA actually includes outage and percentage criteria! Here generic is not appropriate, but static, a fixed SLA definition is invoked. Only the options that belong to this SLA will be seen.
3.2. SLA-Detail page
Integrating the SLA information into tables provides a fast overview, but of course you can also consider the results in detail. A click on the cell with the SLA data takes you directly to the detail page of the SLA results from the affected service.
Here four different types of information can be found:
- raw data of the availability,
- a summary of all of the requirements of an SLA,
- individual results of all of the requirements of a SLA and
General information: Here you can see the raw availability data, and thus the SLA calculations as an overview of the status of each period, and below it the aggregated results of the SLA’s requirements.
See Computation plug-in information for information on each individual requirement of the SLA. The timeline shows every single state, in the Result row you can find the results for each individual reporting period. A special feature here: If you, as described in the example, have set SLA levels and the SLA even before breaking goes to CRIT, this will be displayed with orange instead of the usual red bars. The bars will then turn red when the SLA breaks. Once you get that move the mouse pointer to the result bar, there via a hover menu you will see the individual events that are responsible for the status; in the following picture the status is nearly WARN – because only four out of the five allowed failures are left – and the SLA broken message will also appear in this menu.
Finally, you will find the configuration data under your SLA’s SLA specification, which will help you to better evaluate and understand the results presented.
A small note about using the view: If you hover the mouse over the result bar of a period the corresponding period will be highlighted - for all individual requirements and also the summary under General information. By clicking you can select/deselect one or more periods. This works in the Result and Aggregated results lines. For example, in the screenshot above the current period is highlighted on the far right.
3.3. SLAs for BI-Aggregates
You've already read above about using the availability for BI aggregates. The SLAs are also available to the aggregates (the top level) – via asmall detour: the status of a BI aggregation can be monitored through the Check State of BI Aggregation rule set as a fully normal service. This then appears, for example, as Aggr MySLA in the host views and can in turn be associated with an SLA over the Assign SLA definition to service rule as used above.
You will find the rule under WATO ➳ Host & Service Parameters ➳ Active Checks ➳ Check State of BI Aggregation. The rule is designed to also query BI aggregates on remote Checkmk servers. Therefore you need to connect the URL to the server and specify an automation user. And of course that desired BI aggregate in the Aggregation Name field: Here you enter the title of a top-level rule from your BI pack.
Caution – there is a risk of confusion here: In the BI configuration create the actual aggregation, i.e. the logic, using rules - and one of the highest rules is specified using their title as "Aggregation".
4. Error handling
What do I check if my SLA does not work, or does not work as expected?
In practice SLAs are an interplay of many different configurations: the SLA itself, the view and service options, time periods, rules and of course availability data. If the SLA shows different results than expected, just go through the complete chain. In case of doubt it also helps to visualize the entire process with pen and paper – to see all of the information involved at a glance. The following points can be used as a small checklist:
- Time periods: WATO ➳ Timeperiods
- Planned maintenance times: WATO ➳ Monitoring Configuration ➳ Recurring Downtimes for Hosts/Services – only with the
- Service times: WATO ➳ Monitoring Configuration ➳ Service Period for hosts and ... for services respectively
- SLA Service Link: WATO ➳ Host & Service Parameters ➳ Assign SLA definition to service
- Service-Configuration: WATO ➳ Host & Service Parameters ➳ MyService
- BI-Configuration: WATO ➳ Business Intelligence ➳ MyBiPack ➳ MyTopLevelRule
- BI-Monitoring: WATO ➳ Host & Service Parameters ➳ Active Checks ➳ Check State of BI Aggregation
- SLA-Configuration: Views ➳ SLAs ➳ MySLA
- Options for the View: Views ➳ MyView
After you have checked the configurations, you can verify the functioning of the SLA using manual (fake) status changes and maintenance times by applying commands to the objects in a view.
How do I find out why my SLA is not being displayed in a view?
In such a case, open the settings of the affected view and first check the obvious: Is there even a column with an SLA? But contradictory filters are a more likely cause: If you have tied the SLA to a service using a rule, this service of course may not be excluded from the view options under Context/Search Filters.
Service-bound SLAs still have one more source of error: As described above, for each service’s view you can only display one rule-linked SLA - and it is that of the first matchingrule. Finally, the view receives only the instruction to display in each line the SLA associated with the service – not the second or fifth connected SLA. Unless you have created appropriate rules they are simply ignored. In such cases, you can change the display via column-specific.
Why am I not being notified about my SLA when it is about to go over/under?
In its simplest form the SLA status changes only when the conditions have been broken. To be notified in advance you must configure the SLA Levels.