We use cookies to ensure that we give you the best experience on our website.  Visit our Privacy Policy to learn more. If you continue to use this site, we will assume that you are okay with it.

Your choices regarding cookies on this site.
Your preferences have been updated.
In order for the changes to take effect completely please clear your browser cookies and cache. Then reload the page.

BI - Business Intelligence

Checkmk Manual
Last updated: August 8 2019

Search in the manual

1. Introduction

Checkmk Business Intelligence - that sounds a bit lofty for what is basically a simple thing. But this name describes the core of the BI module in Checkmk pretty well. It all is about deriving the overall state of business-critical applications from the many individual status values, and presenting them clearly.

Take as an example the service email, which is still indispensable for many companies. This service is based on the correct functioning of a variety of hardware and software components - from specific switches, to SMTP and IMAP services, and to infrastructure services such as LDAP and DNS.

The failure of an essential building block is not a problem if this has been designed to be redundant. Conversely, a problem may occur in a service that at first glance has nothing to do with email, but which can have much more serious effects. A simple look at a list of services in Checkmk is not always meaningful - at least not for everyone!

Checkmk BI allows you to derive a summary of the overall health of an application from the current state of individual hosts and services. You use BI rules to define in a tree-like structure how elements are interdependent. Each application is then overall OK, WARN or CRIT. The Information about the condition and the dependencies can be accessed in different ways:

  • A display of the overall status of an application in the GUI.
  • Calculation of the [availability] of an application.
  • [notifications] in the event of a problem, or even a failure of an application.
  • Impact Analysis: A service is in a CRIT state - which applications are affected?
  • Planning maintenance times and ‘what if...’ analyses.

In addition there is the possibility of using the tree representation in BI for a ‘drill down’ view of the state of a host and all its services.

A peculiarity of Checkmk BI, unlike comparable tools in the monitoring environment, is that here Checkmk also works rule-based. This allows you to dynamically describe an indefinite number of similar applications with a generic set of rules. That immensely facilitates the work and helps to avoid mistakes - especially in very dynamic environments.

2. Configuration Part 1: The first unit

2.1. Terms

Before you begin step by step with its practical application, you first need know a few terms:

Each application formalised with BI is called an Aggregation, since a general state is aggregated from many individual states.

An aggregation is a tree of objects. These are called nodes. The bottom nodes - the leaves of the tree - are the hosts and services in your Checkmk instances. The remaining nodes are artificially-created BI objects.

Each node is created by a rule. This also applies to the root of the tree - the topmost node. These rule determines which nodes hang under another node, and thus from their states the state of the node is to be determined.

The top node of an aggregation (the root of the tree) is also generated with a rule. In this way a rule can generate multiple aggregations.

2.2. An example

The easiest way understand this is to use a concrete example. We have thought up the ‘Mystery Application’ for this article. Suppose that this is an important application in an unspecified company. Among other things, five servers and two network switches are playing an important role. So that you can better understand the example, we use simple names like srv-mys-1 or switch-1. The following diagram gives a rough description of the structure:

  • The two servers srv-mys-1 and srv-mys-2 form a redundant cluster on which the actual application runs.
  • srv-db is a database server that stores the data of the application.
  • switch-1 and switch-2 are two redundant routers connecting the server network to a higher network.
  • In each there is a timer srv-ntp which ensures an exactly synchronous time.
  • In addition the server srv-spool works here and passes the results calculated by the Mystery Application into a spool directory.
  • From the spool directory the data is picked up by a mysterious parent service.

If you want to work through the following steps one by one, you can simply copy the listed monitoring objects. For a test it is sufficient if you clone an existing host several times and name the clones accordingly. Later there are a few services to be added into the game, for which you then have time to record the relevant hosts in the monitoring. Even there you can cheat again: with simple dummy-local-checks you will quickly get matching services to play with.

The hosts will then look something like this in the monitoring:

2.3. Your first BI rule

Start with something simple - with the simplest possible meaningful aggregation: an aggregation with only two nodes. You then want to summarise the states of the hosts switch-1 and switch-2. The aggregation should be called network and should be OK if both switches are available. In the case of a partial failure, it should go to WARN, and if both switches are off, CRIT.

Get started: configure BI through the Business Intelligence WATO module. The configuration of the rules and aggregations happens within the configuration packages – the BI Packs. The packages are not only practical because you can better manage more complex configurations with them – you can also apply permissions to a package and assign certain contact groups – and thus users – without admin rights permissions to allow them to edit parts of the configuration. But more on that later ...

The first time you call the BI module it looks like this:

There is already a package titled Default Pack. This contains a demo for an aggregation which summarizes the data of an individual host.

For this example it is best to create a new package (the New BI Pack button), which you name Mystery. As always in Checkmk, specify an internal ID (mystery) which cannot be changed later, and a descriptive title. The Public option is needed by other users if there are rules in this package they want to use for their own rules or aggregations. Because you probably want do your experiments alone in peace, leave it disabled:

After the creation you will of course find two packages in the main list:

Before each entry is a symbol for editing the properties (), and a symbol to get to the actual content of the Package (), which is where you want to go now. Once there you create your first rule.

As always in Checkmk, this rule also needs to have a unique ID and a title. The title of the rule not only has documentation character, but will later also be visible as the name of the node this rule creates:

The next box is named Child Node Generation and is the most important. Here you specify which objects in this node should be summarized. This can either be other BI nodes – for which you would choose a different BI rule – or be monitoring objects, i.e. hosts or services.

For the first example select the second variant and set two objects as children – namely the two hosts switch-1 and switch-2. This is done with the Add child node generator button. Here you logically choose State of a host, and enter the name of the host:

In the third and final box, Aggregation Function, you specify how the monitoring status of the node should be calculated. The basis for this is always the list of states of the subnodes. Different logical links are possible.

Pre-selected is Worst - takes worst of all node states. That would mean that the node becomes CRIT as soon as any one of the sub-nodes is CRIT or DOWN. As mentioned above this should not be the case here. Choose instead Count the number of nodes in state OK to get the number of subnodes with status OK as a yardstick. Here for the thresholds the two numbers 2 and 1 are proposed. That's great because it is exactly what you need:

  • If both switches are UP (this is counted as OK), the node should therefore be OK.
  • If only one switch is UP, the state becomes WARN.
  • And when both switches are DOWN, the state becomes CRIT.

And this is how the mask will look:

With a click on Create you will have your first rule:

2.4. Your first Aggregation

Now it is important that you understand that a rule is not an aggregation. Checkmk can not yet know if this is everything or just part of a bigger tree! Real BI objects are only created and become visible in the status interface when you create an Aggregation. To do this, switch to the list of aggregations.

The New Aggregation button takes you to a mask to create a new aggregation. There is little to fill in here. You can choose any of Aggregation Groups where you then specify a name of your choosing. These then appear in the status interface as groups, under which all of those aggregations which share this group name become visible. This is actually the same concept as hashtags or keywords.

However, it's important that you leave Rule to call set to Call a rule, and at Rule: select the rule you have just created (and before that the Rule package in which it is located).

If you now create the aggregation with Create, you are done! Your first aggregation should now appear in the status interface - assuming in fact that you also have at least one of the hosts switch-1 or switch-2!

3. BI in Operation Part 1: The Status View

3.1. Displaying all aggregates

If you have done everything correctly you can now see your first aggregate over the status interface. The easiest way to do this is via the Views element in the sidebar using the Business Intelligence ➳ All Aggregations entry:

3.2. Working with the tree

Take a closer look at the appearance of the BI tree. The following example shows your mini-unit in a situation where one of the two switches is DOWN, and the other UP. As desired, the aggregate enters the WARN state:

You also see that to standardise hosts and services the host that is DOWN, is treated almost like a service that is CRIT – off UP accordingly becomes OK.

Use the black triangle to expand and collapse views of subtrees.

The leaves of the tree show the states of hosts and services. The host name - and for services also the service name - is clickable and takes you to the current status of the corresponding object. Furthermore, you can also see the last output from the check plug-in.

To the left of each aggregate you will find two symbols: and . With the first icon - - you come to a page that displays just that single aggregate. This is naturally mainly useful if you have created more than one aggregate. It is for example well-suited as a bookmark. will take you to the calculation of the availability. More on this later.

3.3. Trying BI: what if?

To the left of the hostname you will find an interesting icon: . This allows a ‘what if’ analysis. The idea behind this is simple: through clicking on the icon it will switch the object to another state as a test - however only for the BI interface – not for real! Multiple clicks will take you from (OK) via (WARN), (CRIT) and (UNKNOWN) back to .

BI then constructs the complete tree based on the assumed status. The following figure shows the minimum aggregate under the assumption that alongside switch-1 which has actually failed, also that switch-2 would be DOWN:

The overall state of the aggregate thereby goes from WARN to CRIT. At the same time the state’s color is backed by a checked pattern. This pattern indicates to you that the real state is actually different. This is not always the case, because some changes in a host or service are no longer relevant to the overall condition, for example if the one in question is already CRIT.

You can use this ‘what if’ analysis in several ways, for example:

  • To test if the BI aggregate reacts the way you want.
  • When planning to shut down a component for maintenance.

In the latter scenario, as a test you set the device to be serviced or its services to . If the total aggregate then remains OK that must mean that the failure current can be compensated for by redundancy.

3.4. Testing BI using fake states

There is another way to test the BI aggregates: by directly changing the actual state of an object. The especially practical in a test system.

For this purpose, the commands has a host/service command named Fake check results. It is by default only available for the Administrator role. This method has been used, for example, for the creation of the screenshots used in this article where switch-1 has been set to DOWN. That's where the telltale text Manually set to Down by cmkadmin comes from.

Here's a helpful little hint: If you work with this method, it's best to disable the active checks for the relevant hosts and services, otherwise at the next check interval they will immediately go back to their actual state. If you are lazy just do it globally via the Master Control sidebar element. Just please NEVER forget to turn it back on later!

3.5. BI-groups

When creating an aggregate we briefly addressed the possibilities of the Aggregation Groups input. In the example you simply confirmed the suggested Main here. You are of course completely free in the allocation of names, and you can also assign an aggregate to multiple groups.

Groups become important when the number of aggregates possibly exceeds want you to see on a screen. You get to a group by clicking on one of the displayed group names on the All Aggregates page - in our example above that is simply on the Main heading. Of course, if so far you only have this single aggregate not much will change. However if you look closely, you will realize:

  • The title of the page is now called Aggregation group Main.
  • The group heading Main has disappeared.

If you want to visit this view more often, simply bookmark it - preferably with the Bookmarks element in the sidebar.

3.6. From host/service to aggregate

Once you have set up BI aggregates, at your hosts and services in the context menu you will find a new icon:

This icon takes you to the list of all aggregations in which the affected host or service is included.

4. Configuration Part 2: Multi-level trees

Following this first brief impression of the BI status interface, we return to configuration – because of course you cannot really impress anybody with such a mini aggregate.

It starts with you extending the tree by one level - that is, from two levels (root and leaves) to three levels (root, intermediate level, leaves). To do this combine your existing node ‘Switches 1 & 2’ with the NTP time synchronization state to a top node ‘Infrastructure’.

But in order - first of all the result in advance:

The prerequisite is that there is a host srv-ntp which has a service named NTP Time:

First create a BI rule which as subnode 1 receives the rule ‘Switches 1 & 2’, and as subnode 2 receives directly the service NTP Time of the host srv-ntp. At the top of the rule, select infrastructure as the rule ID and Infrastructure as the name. You need to enter no more information at this point:

In the Child node generation it gets interesting. The first entry is now of type Call a rule, and as a rule choose your rule from the above – so you actually ‘hang’ these rules virtually in the subtree.

The second subnode is of the type State of a service, and here choose your NTP Time service (please observe the exact spelling here, including upper and lower case characters):

The Aggregation Function in the third box you will leave this time as {Worst - take worst state of all nodes}}.

In this function the state of the node is thus derived from the worst status of a service below it. In this case: if goes to CRIT the node also goes to CRIT.

Of course to make the new, bigger tree visible, you'll once again need to create an aggregation. It's best to just change the existing aggregation so that from now on the new rule is used:

In this way you stick to a single aggregation, which then looks like this (this time both switches are back on OK):

5. BI in operation Part 2: Alternative Representations

Now that you have a slightly more interesting tree you can get a little closer to dealing with the various presentation options that CMK offers. The starting point for these is the so-called Display options, which you can access via the icon at the top of each status view. This opens a box with various options. The content of this box is always conformed to the elements shown on the page. In the case of BI you can currently find four options:

Instantly expand or collapse trees

If you display not just a single aggregate, but many, then the Initial expansion of aggregations setting is helpful. Here you lay down determinates for how far the trees should be unfolded when first displayed. The selection ranges from closed (collapsed) over the first three levels, to completely open (complete).

Only show problems

If you enable the Show only problems option, only such branches that do not have the status OK in the trees will be displayed. This will then look like this:

Types of tree representation

Under the item Type of tree layout you will find several alternative display types for the tree. One of these is called Table: top down and looks like this:

Extremely space-saving - especially if you want to see many units at the same time - is the representation Boxes. Here each node is a colored box which can be expanded with a click. The tree structure is no longer visible, but you can quickly click through to a problem with minimal space consumption. Here in the example the boxes are unfolded completely:

6. Configuration Part 3: Variables, Templates, Searches

6.1. Configuration with more intelligence

Continue with the configuration. Now it's time to really get down to business. So far the example has been so simple that it was possible to individually list all of the objects in the aggregation without difficulty. But what if things get more complex? What if you want to formulate many recurring same or similar dependencies? What if an application includes not a single, but multiple instances? What if you should want to merge hundreds of a database’s individual services into one BI node?

Well, for such requirements you need more powerful methods of configuration. And these are exactly what distinguishes Checkmk BI over other tools - and unfortunately the learning curve is a bit steeper. It is also the reason why Checkmk BI does not allow itself to be configured by ‘drag and drop’. Once you get to know the possibilities however, you will certainly not want to go without them.

6.2. Parameters

Let’s start with the parameters. Take the following situation: you not only want to know if the two switches are UP, but also want know the state of the two ports that are responsible for the uplink. In overall terms, it concerns the following four services:

Now the node Switch 1 & 2 should be extended to replace the two host states for switches 1 and 2 so that each has a subnode showing the host status and the two uplink interfaces. These two subnodes should be Switch 1 or Switch 2.

Actually you now need two new rules - one for each switch. It is better to do this by creating a new rule, and equip it with a parameter. This parameter is a variable that you call when you call the rule from the parent node – which here can be provided by the old rule Switch 1 & 2. In this example you can simply pass either a 1 or a 2. The parameter gets a name which you can choose freely. Take here for example the name NUMBER. The spelling with capital letters is purely arbitrary, and if you find lowercase letters more beautiful you are also free to use these.

And the rule’s heading will look like this:

You can choose switch as the ID for the new rule. At parameter simply enter the name of the variable: NUMBER. Also important now is that the variable is used in the rule’s Rule Title so that both nodes are not just called switch and thus have the same name. When using the variable a leading and trailing dollar sign is set (as usual at many places in the Checkmk). As a result the two nodes will then be called Switch 1 and Switch 2.

Prefix match is the default for service name

For the Child node generator, the first thing to do is to insert the host state. Instead of the 1 or 2 in the hostname you may simply use your variable, again each with a leading and trailing $.

The same thing happens with the hostnames of the uplink interfaces. And here comes the second trick – because as you might think from the small service list seen above, the services for the uplink are differently named at each switch! But that is no problem, because BI is the service name - completely analogous to the well-known service rules - always as a prefix match which interprets regular expressions. So by simply writing Interface Uplink, you catch all of the services on the respective host which start with Interface Uplink:

By the way: By appending $ you can disable the prefix behavior. In regular expressions a $ means ‘The text must end here’. So Interface 1$ matches only with Interface 1, and not also, for example, with Interface 10!

Now modify the old Switch 1 & 2 rule so that instead of the host states this new rule is only ever invoked once for each of the two switches. And here is also where the values 1 and 2 are provided as the parameters for the variable NUMBER:

And voila – you now have a pretty tree with three levels:

6.3. Regular expressions, missing objects

The subject of regular expressions is again worth a closer look. When matching the service name we have at the beginning tacitly understated that it is basically just concerns regular expressions. As just mentioned, there is a prefix match.

So if you have a service name in a BI node, for example, specify disk, all of the of the host in question’s services that begin with Disk will be captured.

The following principles generally apply:

  1. If a node refers to objects that do not (currently) exist, they are simply omitted.
  2. If a node becomes empty, it will be omitted.
  3. If the root node of an aggregate is also empty, the aggregate itself will be omitted.

Maybe that sounds a bit bold for you! Is not it dangerous to just silently omit things that should be there if they are missing?

Well - over time you will notice how practical this concept is, because this will allow you to write ‘smart’ rules that can react to very different situations. Is there a service that does not exist with every instance of an application? No problem - it is only considered if it is there! Or can hosts or services be temporarily removed from monitoring? These then simply disappear from BI without leading to errors or the like. BI is not there to see if your monitoring configuration is complete!

Incidentally – this principle also applies to explicitly defined services, since these do not actually exist because the service names are always viewed as regular expressions even if they do not contain special characters such as .*. It is always automatically a search pattern.

6.4. Creating a node as the result of a search

But you can still automate further and, above all, react flexibly to changes. Continue with the example of the two application servers srv-mys-1 and srv-mys-2 from the example. Your tree should continue to grow. The Infrastructure node should slip to level 2. And as a definitive root, there should be a rule with the title The Mystery Application under which everything will hang. Alongside Infrastructure there should be a node named Mystery Servers. Under this the (currently) two mystery-servers are supposed to hang. In each a few generic services come into the aggregate. The result should look like this:

Bottom Rule: Mystery Server X

Start from the bottom, because that is always the easiest way in BI. Below is the new Mystery Server X rule. Of course you have a single parameter so that you do not need a separate rule for each server. You can again name the parameter NUMBER, for example. It should then later have the value 1 or 2. As already done above you will again have to enter NUMBER in the header at Parameters.

The resulting child-node generator looks like this:

What follows is remarkable:

  • The hostname srv-mys-$NUMBER$ will use the number from the parameter.
  • With Service: the sophisticated regular expression CPU|Memory is used which uses a vertical bar to allow alternative service names (prefixes), and matches all services that begin with CPU or Memory. This saves a doubling of the configuration!

Incidentally, this example is of course not necessarily perfect. For example the status of the host itself has not been recorded at all. So if one of the the servers goes DOWN, the services on this will become obsolete (go stale), but the state will remain OK, and the aggregate of that failure is not ‘noticed’. If you want to know something like that, as well as the services you should in any case also record the host status!

Middle Rule: Mystery Servers

This rule is interesting. It summarises the two mystery servers together into a node. Now it should be possible that the number of servers is not fixed, and later there can sometimes be three or more, or it could be that there are dozens of instances of the mystery application - each with a different number of servers!

The trick is in the child node generator type Create nodes based on host search. This searches for existing hosts and creates nodes based on the hosts found. It looks like this:

The whole thing works like this:

  1. You formulate a search condition to find hosts.
  2. A child node is created for each host found.
  3. You can cut parts out of the found hostnames and provide these as parameters.

Finding is the beginning. As usual there are host tags available. In the example you can omit this and instead use the regular expression srv-mys-(.*) for the host name. This matches to all host names starting with srv-mys-. The .* stands for any string.

It is important that the .* is bracketed, thus (.*). By using the parentheses the match forms a so-called group. With this the text which exactly matches .* is captured - here 1 or 2. The match groups are numbered internally. Here there is only one that receives the number 1. You can then later access the matched text with $1$.

The search will now find two hosts:

Hostname Value for $1$
srv-mys-1 1
srv-mys-2 2

For each host found you will now create a subnode with the Call a rule function. Select the rule Mystery Server $NUMBER$ which you just created. As the argument for NUMBER now pass the match group: $1$.

Now the sub-rule Mystery Server $NUMBER$ is called twice: once with 1 and once with 2.

If in the future a new server with the name srv-mys-3 be added into the monitoring, this will automatically appear in the BI aggregate! The state of the host does not matter. Even if the server is DOWN, it will of course not be removed from the aggregate!

Granted, it is a very steep learning curve here. This method is really complex. But once you've tried it and understood it, you will understand just how powerful the whole concept is. And so far we have only scratched the surface of the possibilities!

The top-level rule

The new top-level node The Mystery Application is now simple: in addition a new rule which has two child nodes of the Call a rule type is necessary. These two rules are the existing Infrastructure rule, and the just newly-created Mystery Servers rule.

6.5. Creating a node with service search

Similar to the host search, there is also a child generator type Create notes based on service search means. Here's an example:

You can use () here – bracketing partial expressions – both at the host and at the service, where:

  • If you choose Regex for host name you i>must define exactly one parenthesis expression. The match text is then provided as $1$.
  • If you choose All hosts, the complete host name will be provided as $1$.
  • You can use several subgroups in the service name. The associated match texts are provided as $2$, $3$ and so on.

And please never forget that you can always use to get online help.

6.6. All other services

In your attempts you may have stumbled over the child generator State of remaining services. This generates a node for any of your host’s services that have not yet been sorted into your BI aggregate. This is useful if you use BI to combine the states of all of a host’s services into clearly-arranged groups - as it is is done in the included example.

7. The predefined host aggregation

As just mentioned you can also use BI to provide the services of a host in a structured way. You combine all services into one tree into an aggregate, and basically use the worst function. The overall status of a host will then only be displayed if there is a problem with the host – you use BI as a clear ‘drill down’ method.

For this purpose Checkmk already provides a predefined set of rules which you just need to unlock. These rules are optimized for rendering services on Windows or Linux hosts, but of course you can customize them to your liking. You can find all of the rules in the rule package Default. As usual, access the rules by clicking :

There you will find a list of twelve rules (abbreviated here):

The first rule is the rule for the root of the tree. The symbol for this rule takes you to a tree view. Here you can see how the rules are nested among each other:

Back in the list of rules, with the Aggregations button you can access the list of aggregations in this rule package - which consists of only one Aggregation. In the Details simply uncheck the checkbox at Currently disable this aggregation and immediately, per Host, get an aggregation titled Host myhost123. The result will then look like this for example:

8. Permissions and visibility

8.1. permissions for editing

Again, back to the rule packages. For all editing actions in BI you usually need to have the Adminstrator role. More precisely, for BI there are two permissions:

By default the User role is only the first of the two active permissions. Normal users can only work in such rule packages for which they have been defined as a contact. This is done in the Details of the rule package. In the following example Permitted Contact Groups the The Mystery Admins contact group has been authorised – thus all members of this group can now edit the rules in this package:

By the way, with Public ➳ Allow all users to refer to this pack you can allow other users to at least use the rules contained here - i.e. to (elsewhere) define their own rules – which can then invoke these rules as subnodes.

8.2. Permissions on Hosts and Services

How is it with the actual visibility of the aggregations in the Status Interface? Which contacts are allowed to see something?

Well - in the BI aggregates themselves you cannot assign any rights. This is performed indirectly through the visibility of the host and services, and it is governed by the See all hosts and services option under WATO ➳ Roles & Permissions:

In the User role, this right is by default disabled. Normal users can see only shared hosts and services, and in BI these are expressed in such a way that they can see exactly all of the BI aggregations which contain at least one shared host or service. Such aggregates however contain only these authorised objects, and they may therefore be somewhat ‘thinned out’. And this in turn means that they can have different statuses for for different users!

Whether that is good or bad depends on what you want. If in doubt you can toggle the permission, and through a detour via BI allow some or all users to see hosts and services for which they are not contacts - and thus ensure that the status of an aggregate is always the same for everyone.

Of course this whole issue only matters if there are in fact aggregates that are so colorfully thrown together that only some users are contacts only for parts of it.

9. BI in Operation Part 3: Maintenance times, acknowledgments

9.1. The General Idea

Wie hält es BI eigentlich mit Wartungszeiten? Nun, hier haben wir lange nachgedacht und mit vielen Anwendern diskutiert. Das Ergebnis ist wie folgt:

How does BI actually manage maintenance times? Well, we have thought long and hard about the matter, and discussed it with many users – the result is as follows:

  • You can not put a BI unit itself directly into a maintenance time - but you do not have to, because ...
  • The maintenance time for a BI aggregate is automatically derived from the maintenance times of its hosts and services.

To understand which rule BI calculates the ‘in maintenance’ status, it helps when you are reminded of what the real idea behind maintenance times is - i.e. the symbol : The object in question is currently being worked on. Failures can be expected. Even if the object is currently OK, you should not rely on it. It can become CRIT at any time. This is known and documented – it should therefore not trigger an alarm.

This idea can be transferred 1:1 into BI: In the aggregate, there may be a few hosts and services that are currently in maintenance. Whether these are just OK or CRIT does not play a role, because it is actually a coincidence if during the maintenance work the objects sometimes go off and on again or not. Just because there is a maintenance object in the unit it does not immediately mean that the application that maps the aggregate is itself ‘threatened’ and must also be marked as ‘in maintenance’. It can also have an installed redundancy which compensates for the failure of the objects in maintenance. Only if such a failure would actually lead to a CRIT state for the aggregate - so there is not not enough redundancy and the aggregate really is threatened - exactly then Checkmk marks it as ‘in maintenance’. Where here as well the current state of the objects generally does not matter.

To put it more concisely, the exact rule is as follows:

A BI aggregate is considered to be ‘in maintenance’ if under the assumption that all of the hosts and services of the aggregate that are currently in maintenance are CRIT, and the remainder are OK, the aggregate becomes CRIT.

Important: the actual current status plays no role in the calculation!

And here we have another example. To save space, this is a variant with only one mystery server instead of two:

First, the host switch-1 is under maintenance. For the Infrastructure node this has no effect, because switch-2 is not in maintenance, and thus Infrastructure is also not in maintenance. There is therefore no icon for derived maintenance times.

But: the service Memory on srv-mys-1 is also under maintenance. This one is not redundant. The maintenance is therefore inherited by the father node Mystery Server 1, then continues up to Mystery Servers and finally to the top node The Mystery Application. So this top node is also in maintenance.

9.2. The Maintenance Time Command

We wrote above that you cannot manually put a BI aggregate into maintenance time? That's only half true, since in fact you can find a command for setting maintenance times in BI aggregates! But this does nothing more than to record a maintenance entry for each host and service in the aggregate! This of course usually leads to the aggregate itself being flagged as in maintenance. But that is only indirect.

9.3. Tuning Options

Above you have seen that the maintenance time calculation is based on a assumed CRIT state. In the properties of an aggregate you can customise the algorithm so that a node that assumes the WARN state is marked as in maintenance. The option for this is called Escalate downtime based on aggregated WARN state:

The basic assumption remains that objects under maintenance are CRIT. There is only a difference where, due to the aggregate function in which a CRIT can become a WARN - as was the case in our very first example with Count the number of nodes in state OK. Here a maintenance time would already have been be accepted if only one of the two switches was in maintenance.

9.4. Acknowledgments

Quite similar to the process with the maintenance times is that if a problem has been acknowledged the information is also calculated automatically by BI. This time the state of the objects certainly plays a role.

The idea here is to transfer the following concept to BI: An object has a problem (WARN, CRIT), but it is known and someone is working on it ().

You can calculate this for an aggregate as follows:

  • Suppose that all hosts and services that have acknowledged problems are OK again.
  • Then would the unit itself again be OK? Exactly then it is also acknowledged as .

However if the aggregate were to remain WARN or CRIT, then it would not be considered as acknowledged, because then there must be at least important problem that has not been acknowledged and thus the OK status will be removed from the unit.

By the way, the will offer you a command for the BI aggregate to acknowledge its problems, but this only means that all hosts and services detected in the aggregate will be acknowledged (only those who currently have problems).

10. Availability

Exactly as with hosts and services, you can also access the BI availability of one or more aggregates for any period of time in the past. To do this the BI module reconstructs the state based on the history of the aggregate’s hosts and services for each time point in the past. Thus you can also calculate availability for such periods in which the unit was not yet configured!

11. BI in Distributed Monitoring

What is actually happening in BI in a distributed environment? That is, when the hosts are spread across multiple monitoring servers?

The answer is relatively simple: it works - without you needing to pay attention to anything. Because BI is a component of the UI, and as standard this is delivered with distributed environment support-capability, it is completely transparent to BI.

Should a location be currently unavailable or manually hidden by you from the GUI, the site hosts no longer exist for BI. That then means:

  • BI aggregates which are constructed exclusively from objects at this location disappear.
  • BI aggregates that are constructed partially from objects at this location are thinned out.

In the latter case, of course, this can affect the status of the affected aggregates. What exact effects it can have depend on your your aggregation’s functions. If you, for example, have used worst everywhere, the status overall simply stay the same or get better, because objects at the no-longer existing location could already have had WARN or CRIT. Of course other states can also arise for other aggregation functions.

Whether or not this behavior is practical for your operation will have to be assessed for individual cases. BI is in any case constructed so that nonexistent objects cannot by included in an aggregate, and thus cannot be missed, because all BI rules work – as already explained above – exclusively with search patterns.

12. Notifications, BI as a Service

12.1. Aktive Checks or Data Source Programs

Can you actually [Notifications|notify] of status changes in BI aggregates? Well - that's not directly-possible at first, since BI exists exclusively in the GUI and has no relation to the actual monitoring. But you can turn BI aggregates into normal services. And these can in turn of course trigger alarms. There are two possibilities:

  • Using the data source program Check state of BI Aggregations
  • With Active Checks of the Check State of BI Aggregation type

12.2. Notifications via a data source program

We will start with the ‘data source program’ method, because this is always good if you wish to generate more than a handful of aggregates as services. You will find the appropriate rule set under Datasource Programs ➳ Check state of BI Aggregations:

Here you can even specify different options for which hosts the services should be added. You do not necessarily have to stick to the host which is running the data source program (Assign to the querying host). It is also possible to assign to the hosts which are affected by the aggregate (Assign to the affected hosts). That however only makes sense if it concerns only a single Host. Regular expressions and substitutions can make you even more flexible with assignments. The whole thing is then performed via the piggyback mechanism.

Important: If the host to which you assign this rule is still listening through the normal agent ensure in its settings that Agent and data Sources programs are run:

12.3. Notifications via an active check

Notification with an active check is more or less the more direct way and it requires no artificial ‘helper host’ when executing the data source program, since it has to query each unit individually, but with larger numbers of aggregates it is significantly less efficient and also more complicated to set up.

Putting it all simply: There is an active check which can retrieve the state of BI aggregates using HTTP from the Web API of Checkmk. You can easily set this up with the Host & Service Parameters ➳ Active checks ➳ Check State of BI Aggregation ruleset:

Please note the following:

  • Enable this rule only for the host that should receive the corresponding new BI service.
  • The URL must be the one that allows this host to access the Checkmk GUI.
  • The user must be a automation user] – only such users may call the Web API. The automation user offers itself here as it is always created automatically for such purposes.
  • At password enter the user’s Automation secret for machine accounts, which you will find in the configuration mask of the user properties.

In the example Automatically track downtimes of aggregation is activated. Strictly speaking, this means the scheduled downtimes – thus the planned maintenance times. This will make the new active service automatically get a maintenance time, even if the BI aggregate also does this!

The new service then shows - with a delay of up to one Check Interval of course - the state of the unit. The example shows the BI-Check on the host srv-mys-1:

As usual you can assign this service to contacts and use it as a basis for a notifications.

13. Performance

13.1. Single Host Aggregations

Finally, a few words about performance. Because performance is always important, Checkmk has many years of hard practical use behind it, and you would not even believe what our dear users have contrived with BI! There has already been a lot of time put into the optimisation of performance, so that BI always responds quickly and consumes little CPU time.

Especially if you work with host aggregations it can quickly develop that you have a few thousand aggregates. So that BI still stays fast in these situations, it is important that you mark aggregates which you know only affect one host .

To do this, in the aggregation’s details activate the Optimization ➳ the aggregation covers only one host and its parents checkbox. It will then be much easier for BI to find the right services.

13.2. Internal process

If you reach a limit where calculation times are slowly becoming noticeable, you will notice this especially in the time shortly after an Activate Changes. BI is designed to calculate the trees in two steps:

  1. The structure of the aggregates is calculated (we call it compiling).
  2. The status of the aggregates is calculated.

The first step is always necessary when the number of hosts or services has been changed, and this is only known through performing an Activate Changes. For aggregations marked as Single Host Aggregations, the compilation step is delayed until the host in question is called. This is an important part of the optimisation.

The status of aggregates will of course always be recalculated when you display an aggregate.