We use cookies to ensure that we give you the best experience on our website.  Visit our Privacy Policy to learn more. If you continue to use this site, we will assume that you are okay with it.

Your choices regarding cookies on this site.
Your preferences have been updated.
In order for the changes to take effect completely please clear your browser cookies and cache. Then reload the page.

A beginner's guide to Checkmk

Checkmk Manual

On this page

Last updated: November 19 2018

Search in the manual

Dear readers,

The smooth operation of IT systems has always been a challenge. Both the complexity of the hardware and software stacks, as well as the demands of users continue to increase – regardless of whether you work with real hardware or with cloud solutions. These days a detailed and comprehensive IT monitoring solution has become a central role in an efficient organisation.

The requirements that users expect from their monitoring are of course as complex as the IT world itself. From its very beginning Checkmk has been developed for large and heterogeneous IT landscapes. That is why it offers a wealth of features and capabilities in order to meet all of the challenges to be found in an organisation. For entry-level users the comprehensiveness of Checkmk can at first be overwhelming.

So that you can nevertheless get your first Checkmk monitoring system into operation quickly and easily, we have broken the Checkmk User's Guide into two parts:

  1. A beginner's guide – this article
  2. A comprehensive reference section

The Beginner's Guide takes you step by step through Checkmk, and it is structured so that you can read it quickly from start to finish, and can then begin working with Checkmk. That is why the guide is also short and concise, and contains no distracting, unnecessary details. At the end of the guide you will have a working Checkmk system. In the last section, some of our experienced consultants will show you a few very useful tips and tricks which have proven themselves in almost every Checkmk installation.

Of course, our beginner's manual leaves many questions unanswered. Answers to these can be found in the manual's reference section. There for each topic you will find all the background and details to gain deeper knowledge.

1. Implementing Checkmk

1.1. Selecting a Checkmk Edition

Before you can install Checkmk, you first have to consider the question of which Checkmk you want. There are four different editions:

The  Checkmk Raw Edition is free and 100% open source, and contains Nagios as its core. It can comprehensively monitor complex environments. You can receive support through the community, on our mailing lists, and in the future also in a community portal.

The  Checkmk Enterprise Editions is aimed primarily at professional users, and beyond the scope of the Raw Edition it offers a number of interesting features, such as a very high-performance core that replaces Nagios, a reporting function, a sophisticated system for the visualization of measured values, a flexible agent deployment function, and much more. For the Enterprise Edition you can get professional support from us and from our partners.

You can find a list of its most important differences compared to the Raw Edition on our homepage.

The Free Edition is the right solution for you if you want to test the Enterprise Edition first without obligation, or if you want to install Checkmk in small operations with up to two sites with 10 monitored hosts each. The Free Edition contains all of the features of the Enterprise Edition and is supplied at no cost. Both the Free Edition and the Raw Edition can be upgraded directly and easily to the Enterprise Edition at a later date.

The  Checkmk Enterprise Managed Services Edition is the right edition for you if you are a managed service provider offering services to your customers. It is a multi-client-capable extension of the Enterprise Edition.

1.2. Choosing a version

We are of course continously developing all Checkmk editions, so there are different versions of each edition. For the entry level we recommend the latest stable version of Checkmk. A detailed overview of what types of other versions still exist can be found in its own article.

1.3. Installing the software

The Checkmk server needs a Linux system on which it can run (of course you can also easily monitor Windows and other operating systems). If you do not wish to set up your own Linux server, you can also operate Checkmk with the help of Docker or an appliance. There are four options in total:

Option 1: Installation on a Linux Server

The installation of Checkmk on a Linux server, whether on a ‘real’ or on a virtual machine is – so to speak – the ‘normal’ method. If you have basic Linux knowledge, this method is very simple, and all the software you need is either in your distribution or is included in our package.

We support the following Linux distributions: Red Hat, CentOS, SLES, Debian and Ubuntu. For each edition and version of Checkmk, each of these distributions has its own customized package created by us. You can find these on our download page. You install a package directly with the package manager applicable to your distribution. Please follow the instructions in the Installation on Linux systems article.

Option 2: The virt1 virtual appliance

With the Checkmk virt1 virtual appliance you get a complete, already set-up virtual machine that you can use in VMware, HyperV or VirtualBox. Alongside Checkmk it also contains a complete operating system based on Debian GNU/Linux. The advantage of the appliance is that with it you can also configure the operating system completely using the graphical interface. Thus administering Checkmk is also possible without an in-depth knowledge of Linux. Updating of Checkmk and many other operations are also made possible without using the command line.

The virtual appliance is only available as part of a subscription. If you have booked the virtual appliance option, please go to the appropriate installation guide. The virtual appliance for the Free Edition is available at no cost.

Option 3: The rack1 and rack4 hardware-appliances

If you prefer a physical hardware appliance, you can choose between several models with different support levels. Once this has been done Checkmk is set up and ready to use. With the hardware appliance you receive a complete system that you can install directly in your data center. With two hardware appliances, in a few easy steps you can combine these into an HA-cluster. The instructions for commissioning the appliances can be found in its own article.

Option 4: Checkmk in a Docker-container

Should you wish to deploy Checkmk using a Docker container you also have this option. We support both the Raw Edition and the Enterprise Edition with finished container images that can be set up in a few simple steps.

Detailed instructions on deploying Checkmk can be found in its own article.

1.4. Creating an instance

Checkmk has a peculiarity that may appear to be superfluous at first, but one which has proved to be very useful in practice: You can have multiple, independent Checkmk instances (Sites) running in parallel on one server. It is even possible for each instance to run a different version of Checkmk.

Here are two common uses for this feature:

  • Uncomplicated trial and error testing of a new Checkmk version
  • Parallel operation of a test instance to monitor hosts that are not yet in live operation

If you have just installed Checkmk, there are no instances yet. We will show you here how to create an instance during a normal installation of Checkmk.

If you are running Checkmk on Linux or using Docker, an instance will be automatically created for you. The Checkmk appliances are managed via a web interface which also covers the creation of instances. This will be explained in an article about the appliance.

First, select a name for your instance. This may only consist of letters and numbers. The convention is to use lowercase letters. In the manual we use the name mysite for all examples. Always substitute your own instance names when you see this field.

The creation itself is very easy. Just enter the omd create command as root user, followed by the name of the instance:

root@linux# omd create mysite
Adding /opt/omd/sites/mysite/tmp to /etc/fstab.
Creating temporary filesystem /omd/sites/mysite/tmp...OK
Restarting Apache...OK
Created new site mysite with version 1.6.0.cee.

  The site can be started with omd start mysite.
  The default web UI is available at http://linux/mysite/

  The admin user for the web applications is cmkadmin with password: ZBdHdkl2
  (It can be changed with 'htpasswd -m ~/etc/htpasswd cmkadmin' as site user.)
  Please do a su - mysite for administration of this site.

When creating a new instance, the following actions will take place:

  • A Linux user and a Linux group are created with the name of the instance in the system. The user is called instance user.
  • A data directory for the instance is created under /omd/sites, e.g. /omd/sites/mysite.
  • A meaningful default configuration is copied to the new directory.
  • For the Checkmk web interface a user with the name cmkadmin and a random password will be created.

Note: If you receive the error Group 'foobar' already existing., then a Linux user with the desired instance name already exists. In this case simply choose another name.

As soon as you have created the new instance, further administration no longer takes place as root, but as the instance user. The easiest way to get here is to use the su - mysite command:

root@linux# su - mysite

At the changed prompt you will see that you are ‘logged in’ in to the instance. As the command pwd shows, you will then automatically be in the data directory for the instance (instance directory):

OMD[mysite]:~$ pwd
/omd/sites/mysite

As you saw in the output from omd create, when you create the instance it automatically generates an Checkmk administrative-user named cmkadmin. This user is intended for logging in to the Checkmk web interface (GUI), and it receives a random password. As the instance user you can easily change this password:

OMD[mysite]:~$ htpasswd -m etc/htpasswd cmkadmin
New password: *****
Re-type new password: *****
Updating password for user cmkadmin

By the way: Whenever we specify path names in the manual that do not begin with a slash, these refer to the instance directory. If you are already in this directory, you can thus use such paths directly. This also applies, for example, to the file etc/htpasswd, whose absolute path here is /omd/sites/mysite/etc/htpasswd, and which is the file containing the passwords for the Checkmk user. Please do not confuse this with /etc/htpasswd!

1.5. Starting and stopping instances

An instance can be started or stopped. The ‘startup mode’ is here automatic, which means that all instances will start automatically following a system reboot. Freshly-created instances begin their lives stopped, however. You can easily verify this with the omd status command which shows the status of all of the individual processes which are required for the operation of the instance:

OMD[mysite]:~$ omd status
mkeventd:       stopped
liveproxyd:     stopped
mknotifyd:      stopped
rrdcached:      stopped
cmc:            stopped
apache:         stopped
dcd:            stopped
crontab:        stopped
-----------------------
Overall state:  stopped

You can start the instance with a simple omd start command:

OMD[mysite]:~$ omd start
Creating temporary filesystem /omd/sites/mysite/tmp...OK
Starting mkeventd...OK
Starting liveproxyd...OK
Starting mknotifyd...OK
Starting rrdcached...OK
Starting cmc...OK
Starting apache...OK
Starting dcd...OK
Initializing Crontab...OK

As expected, the status following this shows all services as running:

OMD[mysite]:~$ omd status
mkeventd:       running
liveproxyd:     running
mknotifyd:      running
rrdcached:      running
cmc:            running
apache:         running
dcd:            running
crontab:        running
-----------------------
Overall state:  running

Because the raw edition does not have all the features of the Enterprise Edition, you will see fewer services. In addition, cmc is replaced by nagios:

OMD[mysite]:~$ omd status
mkeventd:       started
rrdcached:      started
npcd:           started
nagios:         started
apache:         started
crontab:        started
-----------------------
Overall state:  started

The omd command has many more options for controlling and configuring instances. All details on these can be found in the corresponding articles covering instances.

There is also a specific article covering more detail on the directory structure of the instance and the options for the command line in Checkmk.

1.6. Logging-in to the instance

Once the instance is running it can be used. Every instance has its own URL which you can open in your browser. This URL is composed of the IP address or hostname of your monitoring server, a slash, and the name of the instance – for example, http://mycmkserver/mysite/. There you will find the following login window:

If your instance has not started, you will see the following error message:

If there is no instance with this name (or you have landed on a server without Checkmk), it will look like this:

Now log in with the user cmkadmin and the initial, randomly-generated password, or respectively your new, updated password. This will land you on Checkmk's homepage:

Important: As soon as you are operating Checkmk in a production environment, we recommend for security reasons that you access the interface exclusively via HTTPS. How to do this is explained in its own article.

1.7. The first overview of the interface

You will see quite a number of elements in the interface which we do not need at this time. Many of these elements are empty, or in any case display only zeros because we have not yet included objects in the monitoring configuration.

Nevertheless, you should first familiarize yourself with the basic elements of the interface. Most important is the division into the Sidebar on the left and the main area on the right. Of course, what you see in the main section depends on where you are in Checkmk right now. After logging in you first start in the default dashboard, which shows a rough overview of the current state and the recent events in monitored objects.

The Sidebar

More important is the page guide. Here you will find a number of elements, also referred to as snap-ins. Depending on the size of your screen, not all snap-ins will be visible. But how does one move the sidebar without scroll bars? Here are two options:

  1. Simply roll the mouse wheel up and down while the mouse pointer is over the sidebar. For touchpads, this feature is often possible with the ‘two fingers up and down’ gesture.
  2. With the mouse just ‘grab’ one of the snap-ins outside of its title bar and move it up or down.

In the default setting (of course, the sidebar is customizable!) you will find the following elements:

  • The Tactical Overview – an overview of all monitored objects
  • The Quicksearch – Search box
  • Views – The directory of various status views
  • Reporting – Create PDF reports
  • Bookmarks – Your personal bookmarks within Checkmk
  • WATO-ConfigurationThe most important: For the configuration of monitoring
  • The Master Control – various main switches for the monitoring service

At the top of the sidebar you will find the Checkmk edition and version identification, as well as the Checkmk logo. A click on the logo will always bring you to Checkmk's home dashboard.

Below the sidebar you will find the icon that brings you to your personal settings. There you can change your password. Finally the icon logs you out of the interface.

2. Setting up monitoring

2.1. Hosts and services, agents

So, Checkmk is now ready. But before we start with the actual monitoring, we should briefly explain some important terms. We will begin with the host. In Checkmk a host is typically a server, a VM, a network device, an appliance, or anything else with an IP address which is being monitored by Checkmk. Every host always has one of the states UP, DOWN or UNREACH. There are also hosts without an IP address, such as Docker containers.

On each host a number of services are monitored. A service can be anything – for example, a file system, a process, a hardware sensor, a switchport – but it can also just be a specific metric like CPU usage or RAM usage. Each service has one of the states OK, WARN, CRIT or UNKNOWN.

In order for Checkmk to be able query data from a host, an agent is usually necessary. This is a small program that is installed on the Host which provides ‘health’ information about the host on request. The manufacturers of network devices and many appliances usually include a pre-installed agent which Checkmk can easily query using the standardized SNMP protocol. Cloud services like AWS or Azure also have features similar to agents, but they are called ‘APIs’ and are queried by Checkmk via HTTP. Servers running Windows, Linux or Unix can only be monitored by Checkmk if you install one of our CMK agents.

2.2. Considerations relating to DNS

Even if Checkmk requires no name resolution of hosts, a well-maintained DNS is of great help with configuration and for avoiding mistakes. Checkmk can then autonomously name the hosts so that you do not have to manually enter any IP addresses in Checkmk.

The implementation of monitoring is therefore a good opportunity to bring your DNS up to date and to add any missing entries!

2.3. Host folder structures

Checkmk manages your hosts in a hierarchical tree of folders – quite analogously to the way you see files in your operating system. If you only have a handful of hosts to monitor, that may not be that important to you – but remember – Checkmk has been designed for monitoring thousands and tens of thousands of hosts. And then good organisation is half the battle won!

So, before you include your first hosts in Checkmk, it is a good idea to give some thought to the structure of these folders, since this is not only useful for your own overview, but is also basically the same method that you can use to define all of the configuration attributes of the hosts in a folder. These attributes are then automatically inherited by any subfolders and hosts this folder contains.

You can of course change the folder structure at any time. You must, however, proceed very carefully, since moving a host to another folder may alter its attributes without your being aware of it.

The real question when building a folder structure that makes sense to you is the consideration of the criteria you want to use to structure the folders. This can be different in each level of the tree. So you can – for example – in the first level order by location, and below that in the second level order by technology.

The following classification criteria have proven themselves well in practice:

  • Location/Geography
  • Organization
  • Technology

Sorting by location is obviously used mostly by larger companies, especially if the monitoring is distributed over multiple Checkmk servers. Each server then monitors a region or a country, for example. If your folders map such divisions, for example, in the folder ‘Munich’, you can define that all hosts in this folder be monitored by the muc instance in Checkmk.

Alternatively, the question of organization – that is, who is ‘responsible’ for a host – can be a more meaningful criterion, because location and responsibility are not always the same. It may be that one group of your colleagues is responsible for the administration of Oracle, regardless of where the respective hosts are located. So, if the Oracle folder is provided for the Oracle colleagues' hosts for example, it is then easy to configure that all the hosts within this folder are visible only to these colleagues and that they can even take care of their own hosts there.

Structuring according to technology could, for example, provide a folder for Windows servers, and one for Linux servers. This in turn simplifies configuration according to the formula ‘the process sshd must be running on all Linux servers’. Another example is the monitoring of devices such as switches or routers via SNMP. Here no agent is used, but the devices are queried via the SNMP protocol. When these hosts are in folders you can make necessary SNMP settings – such as community – directly in the folder.

Of course, a tree structure does not reflect the whole complexity of reality – with the host properties (tags) Checkmk provides another structure option that intelligently complements the trees. But more about this later. Further information on structuring the folders can be found in the reference section.

2.4. Creating folders

The function for the administration of folders and hosts can be found in the WATO ➳ Hosts module, which you can reach via the WATO – Configuration sidebar element:

One folder – the root folder – is present in a freshly-installed Checkmk system. This is named Main Directory by default, but if you don't like this name, you can easily rename it by using the Folder properties button. You can create new hosts directly here, but it is better if you first create some suitable subfolders.

For our beginner's manual we will use a simple example – the three folders Windows, Linux and Network. Create these three folders by clicking the New folder button and in the first menu titled General properties, enter each folder's respective name:

Tip: If you are too lazy to scroll to the Save & Finish button, just press enter while the cursor is still in the text input field. That also performs a save, and exits the form.

After that the situation will look like this:

Tip: In many windows (as seen here when creating a new folder) you will see a small icon of a book in the upper right corner . With this you can turn online help on and off. The help explains the individual input fields.

2.5. Adding the first hosts

Now we are ready to add the first host into the system. And what could be more obvious to monitor than the Checkmk server itself? Of course, this will never be able to notify its own total failure, but it is still useful, since you will not just get an overview of the CPU and RAM usage, but also quite a few metrics and checks about the Checkmk system itself.

The procedure for adding a Linux or Windows host is always the same:

  1. Download the Checkmk agent
  2. Install the Checkmk agent on the destination host
  3. With WATO add the host into a suitable folder
  4. Perform a service configuration
  5. Activate the changes

Downloading the Checkmk Agent

Because the Checkmk server is a Linux machine, you need the Checkmk agent for Linux. You can find this directly in the interface under WATO ➳ Monitoring Agents.

Click here to access the Enterprise Edition Agent Bakery. This allows ‘baking’ of individually-configured agent packages – however, this always generates a generic agent without you needing to do anything:

Choose RPM format for Red Hat, CentOS, or SLES, and DEB format for Debian and Ubuntu. Download the file and copy it to the Checkmk server.

The Raw Edition does not have an agent bakery. Clicking WATO ➳ Monitoring Agents takes you directly to a download page on which you can find preconfigured agents and agent plug-ins. (In the Enterprise Edition this same page can be found under Agent files.)

From the first box, Packaged Agents, select one of the two Linux packages (RPM/DEB) and copy it to the Checkmk server.

Installing the Checkmk agent on the destination host

In the example below, assume that you put the file in the /root directory – i.e. in the home directory of the root user. This file is only needed during installation – you can delete it later.

The installation is done as root on the command line with either rpm, preferably with the option -U ...

root@linux# rpm -U check-mk-agent-1.6.0-3a83e51d5c12619c.noarch.rpm

... or for DEB respectively with the

dpkg -i command:
root@linux# dpkg -i check-mk-agent_1.6.0-3a83e51d5c12619c_all.deb

Important: In order to function, the agent requires either systemd – which in newer distributions is the default – or the auxiliary xinetd. What the situation is in your case can be easily seen in the output when installing the agent:

Agent running ... Output
with xinetd Reloading xinetd...
with systemd Enable Check_MK_Agent in systemd...
agent not running Neither of the two above messages appear – but: This package needs xinetd to be installed for full functionality.

If you have neither systemd nor xinetd, simply install xinetd. That is performed on RedHat/CentOS with:

root@linux# yum install xinetd

On SLES the command is:

root@linux# zypper install xinetd

And on Debian/Ubuntu:

root@linux# apt install xinetd

Testing the Checkmk agent

Incidentally, the Checkmk agent for Linux is an executable program (shell script) which you can easily test by calling the check_mk_agent command:

RPM:check_mk_agent
<<<check_mk>>>
Version: 1.6.0
AgentOS: linux
Hostname: linux
AgentDirectory: /etc/check_mk
DataDirectory: /var/lib/check_mk_agent
SpoolDirectory: /var/lib/check_mk_agent/spool
PluginsDirectory: /usr/lib/check_mk_agent/plugins
LocalDirectory: /usr/lib/check_mk_agent/local
...

To test the accessibility of the agent from outside, from an external system by using telnet you can attempt to connect to port 6556. Here the agent should respond with the same information:

root@linux# telnet mycmkserver 6556
Trying 192.168.56.100...
Connected to mycmkserver.example.net.
Escape character is '^]'.
<<<check_mk>>>
Version: 1.6.0
AgentOS: linux
Hostname: linux
...

Note: By default the agent is reachable from the entire network and can be queried without requiring a password. As the agent does not accept commands from the network, however, a potential attacker cannot gain access. Information such as the list of current processes is still visible, though. How to protect the agent can be learned in the article about the linux agent.

Add the host to a suitable folder with WATO

After the agent has been installed on the destination host you can start monitoring it. In our example that is the Checkmk server itself, but that does not really make a difference.

Now go back to the WATO ➳ Hosts module and there switch to the Linux folder by simply clicking on the folder's graphic. Click on New host.

There you will find a form with several boxes and many input options. As mentioned at the beginning, CMK is a complex system which has an answer to every question. That is why you can perform a lot of configuration in a host.

The good news is that you only have to fill in one field, namely the Host name field in Basic Settings. You can use this name freely. It serves as a key in monitoring at all points and is the unique name for the host:

If the host is resolvable under its own name in DNS, you are already finished with this form. If not, or if you do not want to use DNS, you can enter the address by hand in the IPv4 address field:

Note: So that Checkmk can always run stably and efficiently, it maintains its own cache for hostname resolution. Thus a failure of the DNS service does not cause a failure of the monitoring system. The DNS query is performed only once – when the host is added to the system.

This cache is automatically renewed every day at 00:05. Clicking on the Update DNS cache button in the Host Properties window of one of your hosts you can rebuild the entire DNS cache manually. Do this if you want a change in your DNS to take effect immediately.

You can find detailed information about name resolution during monitoring in the article covering host administration.

2.6. Diagnostics

Everything that can go wrong eventually will go wrong – and, of course, especially when you are doing things for the first time! That is why good fault diagnosis options are so important. One of these options can be found in WATO if you have set Save & Test in the host's properties. Alternatively, in the Host Properties, by using the Diagnostic button you can also at any time come to the same diagnostic page – but in this case without first needing to save.

Scroll down the diagnostics page and press Test. Now Checkmk will try to reach the host in all possible ways. For Windows and Linux hosts only the two upper boxes are interesting:

Other boxes try to contact via SNMP and these are very useful for network devices in ways that we will be discussing below.

On the diagnostics page in the Host properties box you can, if necessary, try a different IP address, and even use this IP address with Save & Exit directly in the host properties.

2.7. Configuring services

Once the host itself has been added, we come to the really interesting part: the configuration of services. This can be achieved in a number of ways:

  • by saving the host properties with Save & go to Services
  • by clicking the icon in the folder view of a host
  • by clicking on the Services button in the host properties, or at the top of any other page for the host

On this page you specify which services you wish to monitor on the host. If the agent is running correctly on the host and is reachable, Checkmk automatically finds a set of services and suggests these to be monitored (abbreviated here):

For each of these services there are in principle three possibilities:

  • Undecided: You have not yet decided whether you want to monitor this service.
  • Monitored: The service is being monitored.
  • Disabled: You have chosen not to monitor the service.

In the beginning all services start as undecided. For starters, it is easiest if you now click Fix all missing/vanished – all services will then be transferred directly to the configuration.

You can call up this view at any time later to configure its services. Sometimes new services are the result of changes to a host, e.g. if you include a LUN as a file system, or configure a new instance of Oracle. These services first appear as undecided, and you can then add them one at a time or all at once into the monitoring configuration.

Conversely, services may disappear, e.g. because a file system has been removed. These services then appear in the monitoring as UNKNOWN, and in the configuration page as vanished. You can remove these from the monitoring here.

The Fix all missing/vanished button performs both of these functions at once – adding missing services, and removing unnecessary ones.

2.8. Activating changes

WATO is basically designed so that any changes you make initially only appear in a preliminary ‘configuration environment’, so that the current production operation is not yet affected. Only after Activate changes (Activate changes) are these transferred into production monitoring. Learn more about the background for this in the article about WATO.

Now click on the button to apply the changes. This brings you to a new page that, among other things, in Pending changes lists changes that have not yet been activated:

Now click on the Activate affected button to apply all changes. Shortly afterwards in the Tactical Overview sidebar you will see how the host and its services appear there. Also in the main dashboard that you reach by clicking on the Checkmk logo at the top left corner, you will now be able to see that the monitoring system has been brought to life.

2.9. Monitoring Windows

As with Linux, CMK also has its own Windows agent. This is provided as an MSI package. You find it at the same location as the Linux agent. Once you have copied the MSI package to your Windows machine you can install it with the usual Windows double-click.

Note: You may need to adjust the firewall settings on Windows, so that Checkmk can access the network.

Once the agent has been installed you can add the host to the monitoring setup. This works in the same way as seen above with the Linux host. Because Windows is structured differently from Linux, the agent, however, finds other services to monitor. More details about monitoring Windows can be found in its own article.

2.10. Monitoring via SNMP

Professional switches, routers, printers and many other devices and appliances already have a built-in interface for monitoring provided by their manufacturer: the Simple Network Management Protocol (SNMP). Such devices are very easy to monitor with Checkmk – and you do not even have to install an agent.

The basic procedure is always the same:

  1. Using the device's management interface, enable SNMP read access for the Checkmk server's IP address.
  2. You assign a Community string. This is nothing more than a password for access. Since this is usually transmitted in plain text within the network, it is of limited sense to make the password very complicated. Most users simply use the same community string for all devices within a company. This also greatly simplifies the configuration in Checkmk.
  3. Create the host as usual in Checkmk.
  4. In the host's properties in the Data sources box, set Check_MK Agent to No agent.
  5. In the same box activate SNMP, and select SNMP v2 or v3.
  6. If the community string is not public, enable SNMP credentials ➳ SNMP community (SNMP Versions 1 and 2c) and enter the community string here.

If you have all SNMP devices in their own folder, simply configure the Data sources directly on the folder – the settings will then automatically apply for all hosts in the folder!

The rest is as usual. If you want you can have one more look at the diagnostics page – there you will also see immediately if the access via SNMP works, here, e.g. for a CISCO Catalyst 4500 switch:

Then click Save & go to Services again to see the list of all services. Of course, this looks completely different from that in Windows or Linux. For all devices Checkmk by default monitors all ports that are currently in use. Of course, you can later adjust this as desired. With a service which is always OK it also shows the general information for the device, as well as its uptime.

2.11. Clouds, Containers and VMs

You can also easily monitor cloud and container services with Checkmk, even if you do not have access to the actual server. For this Checkmk uses the providers' APIs. These APIs use HTTP or HTTPS. The basic principle is always the same:

  1. You set up an account for Checkmk in the provider's management interface.
  2. In Checkmk you create a host to access the API.
  3. For this host you create a configuration to access the API.
  4. For the monitored objects, such as VMs, EC2 instances, containers, etc., create or automate additional hosts in Checkmk.

There are step-by-step instructions in the manual for all of these:

3. The User Interface

3.1. The Status Interface

Now that we finally have something for our monitoring system to do, it would make sense for us to have a closer look at the interface. Above all we are interested in the things relevant to operations – with the everyday life of a monitoring system, so to speak. In Checkmk this component is also sometimes referred to as the status interface, because it is mostly about seeing the current status of all hosts and services.

3.2. The Tactical Overview

Let's take a closer look at the Tactical Overview:

In the left column of this small table you will first see the number of monitored hosts and services. The third line shows Events. These will only become relevant for you if you have configured a monitor for messages – here we mean messages from syslog, SNMP traps and logfiles, for example. For this Checkmk has its own very powerful module, the Event Console, which will not be discussed in this beginner's guide.

The second column shows the problems. These are the monitored objects which have the status WARN/CRIT/UNKNOWN, or DOWN/UNREACH. You can click on the number in the cell and be linked directly to the objects that are counted here.

The third column can never be bigger than the second one, because it shows those problems that are still unacknowledged. An acknowledgment is a kind of ‘recognition’ of problems, a subject which we will discuss later.

The last column shows objects that are currently stale. These are hosts or services that currently have no up-to-date monitoring data available. If a host is currently not available, Checkmk of course can have no information about its services. That does not necessarily mean that there is a problem with them. That is why Checkmk does not just assume a new status for these services, instead it flags them with the pseudostate stale. The Stale column will be missing if all other fields show a 0 (zero).

3.3. Bookmarks

For pages you visit regularly you can create bookmarks with the Bookmarks snap-in:

But why do you need these bookmarks? After all, there are also bookmarks in the browser! Well, the Checkmk bookmarks have a few advantages:

  • You only change the content on the right side without reloading the sidebar.
  • You can share bookmarks with other users.
  • Setting bookmarks automatically prevents the repetition of actions.

The Checkmk bookmarks are organized in lists. Such a list is a collection of bookmarks that you can manage as a whole. So you can, per list, decide if the list should be provided to other users or stays private for your use.

Besides, each bookmark has a topic – this is the folder under which the bookmark is saved in the sidebar.

Important: A list can sort bookmarks into different topics! Or vice versa – a topic can also contain bookmarks or different lists.

To start with, the snap-in for the bookmarks is still empty:

If you click Add Bookmark, a new bookmark will be generated from what is currently displayed in the main view, and this new bookmark will be automatically saved in the (Topic) My bookmarks folder.

If you want look deeper into the subject of bookmarks, you can find more details in the GUI Reference.

3.4. Quicksearch

The Quicksearch element searches for hosts and services in the status interface (not in WATO!). It is very interactive. Once you've typed something, you immediately see auto-completion suggestions. Here are a few tips:

  • The search is not case-sensitive.
  • You do not have to select an entry from the suggestion list. Just press Enter to find a view of all the hosts or services that match the search expression.
  • You can save the result of the search in a bookmark.
  • If you want to search for host and service patterns, you can work with h: and s: in combination. A search for h:win s:cpu will show you all the services that contain cpu on all hosts that contain win.

3.5. The Master Control

In the Master control element you can turn various functions of the monitoring system on and off individually – such as the alerting (Notifications) for example. This latter is very useful if you are making major alterations on the system and want to avoid annoying your colleagues with useless messages.

Please make sure that all switches are set back to on during normal operation, otherwise important monitoring functions may remain switched off!

3.6. Customizing the sidebar

Each of the items can be removed and collapsed from the sidebar. You have two icons in the upper right corner of each element. Clicking on the cross removes the element. A click on the small dash collapses the element. When an element is collapsed, the small dash changes to a square. If you click on the square, the element will unfold again.

You will find the icon on the far left at the bottom of the sidebar. With this you can extend the sidebar with additional snap-ins. Clicking on the icon will show you all available elements, which you can then simply click on to add. Note that these appear at the bottom and you may need to scroll down the bar to see them.

The order of snap-ins in the sidebar can be changed easily with the mouse. Click with the left mouse button on the upper edge of the snap-in, hold the mouse button down and move the snap-in to the desired position.

If you want to hide the sidebar in order to enlarge another window, all you have to do is move the mouse pointer to the very left of the sidebar's frame and click to collapse the sidebar – you will then only see a black vertical line. If you later click on this, you can unfold the sidebar again.

3.7. Views

The Views Snap-in

The most important snap-in for an operation is next to the Tactical Overview – the one titled Views. A view is a status display that shows you the current state of hosts or services (and sometimes other objects).

Such a view may have a context, e.g. when they contain all services of the host myhost012. Other views operate globally, e.g. the one that shows you all of the services that currently have a problem.

All of these global views are accessible through the Views snap-in. The views are grouped into Topics (folders) which can be opened and closed individually:

Navigating in Views

You have numerous options in the status views:

  • You can navigate to other views by clicking certain cells (here, for example, the host name or the number of its services in the WARN state).
  • By clicking on a column title you can sort by this column.
  • Click on to see a whole series of other buttons that will take you to related views.
  • The button opens a series of search fields which you can use to filter the objects shown.
  • allows you to change the number of columns displayed (to take full advantage of your wide screen). You can also change this with the mouse wheel when the pointer is over this button.
  • With you set the number of elapsed seconds after which the view is automatically refreshed (after all, status data can change at any time).

The views have many more options, so that you can customize the views, and even build your own views. You can find out how to do that in a separate article.

3.8. Metrics

The vast majority of services not only provide a condition, but also measured values. As an example, take the service which checks the file system C: on a Windows server:

In addition to the status OK, the file system's total capacity of 135.78 GB is 68.67 GB full, equivalent to 50.57%. The details are shown in the text section of the status output. The most important value of this – the percentage – is also visualized on the right side in the Perf-O-Meter column.

But this is just a rough overview. A detailed table of all measured values for a service can be found in its detail view in the Service Metrics line:

Even more interesting, however, is that Checkmk automatically stores the time line of all such readings for up to four years (this is of course customizable). Within the first 48 hours, the values are stored to the minute. Time lines are displayed in graphs like this, as they are shown in the  Checkmk Enterprise Editions:

Here are a few tips on what you can do with these graphs:

  • If you move your mouse over a reading, a small pop-up opens with the exact values for that time.
  • ‘Position’ the graph anywhere in the data area. Move the mouse left or right to adjust the time range.
  • While still holding down the mouse button, slide up and down to scale the graphs vertically.
  • With the mouse wheel you can zoom in and out in the timeline.
  • You can resize the graph with the in the lower right corner.

In the  Checkmk Raw Edition there is also a system for displaying graphs. This is based on PNP4Nagios and is not interactive.

The system for recording, evaluating and displaying measured data in Checkmk can do much more – especially in the  Checkmk Enterprise Editions. Details can be found in its own article.

4. Checkmk in Operation

4.1. Important Functions in Operation

We have included hosts in the configuration, and we have looked at the operation of the status interface. Now we can start with the actual monitoring. It's important to bear in mind that the purpose of Checkmk is not to constantly occupy staff with its own configuration, but to support an IT department.

Now the different status views show you exactly how many and what problems there are. However, for the illustration of workflows, and for ‘working’ properly with the monitoring we need something more:

In this chapter, we will start with only the first two elements. The alerting will be handled separately later – with good reason, as we will see.

4.2. Acknowledging Problems

In the Tactical Overview we have already seen that problems can be either unhandled or handled. An Acknowledgment is the action that changes an unhandled problem into a handled one. That does not necessarily mean that someone really cares about the problem. Some problems even disappear by themselves. But acknowledging them helps you keep track and to establish workflows.

What exactly happens when you acknowledge a problem?

  • The host/service will no longer be listed in the third column in the Tactical Overview.
  • The default dashboard also does not list the problem.
  • The object is marked with the icon in status views.
  • By acknowledging, an entry is made in the object history so that you can follow it up later.
  • Repeating alarms (if configured) are stopped by acknowledgments.

Acknowledging individual problems

So, how do you acknowledge a problem? Well, first open it in a status view. There are two ways of acknowledging – the first way is the best if you just want to acknowledge a single problem. To do this, click through to the details of the host/service – thus the view titled ...

  • Status of host myhost123 in the case of a host
  • Service myhost123, FOO Service in the case of a service

Now click on the symbol at the top. This will open a number of input fields through which you can take numerous actions on the displayed host/service. The searched-for object is the field at the top:

Enter a comment here and click on Acknowledge – and after the obligatory “Are you sure?” question...

... the problem will be considered as acknowledged. Here are some hints:

  • You can also remove an acknowledgment with the Remove acknowledgment button.
  • Acknowledgments can automatically expire. The Expire Acknowledgment after ... option provides for this.

Acknowledging several problems simultaneously

It's not that unusual to have a number of (related) problems needing to be acknowledged at the same time. This can be handled almost as easily. Call up a status view which shows all of these problems. Sometimes that works with Quicksearch, and the Services ➳ Service Search view is somewhat more flexible.

Once you have got a view of the exact services to be acknowledged, simply proceed as described above. The command will be automatically applied for each of the services shown.

However, if you need a specific selection, with a click on you can open a checkbox for each line. Check the required hosts or services boxes, and then execute the command.

Attention: Never forget that commands are always performed on ALL displayed objects if you have NOT activated ANY checkboxes!

4.3. Downtimes

Sometimes things have not been broken accidentally, but on purpose. Or as we prefer to say, the problem is expected. For example, every piece of hardware or software must be serviced occasionally, and while the necessary maintenance work is being performed the affected host or service in the monitoring will of course, go to WARN or CRIT.

For those who need to respond to problems in Checkmk, it is naturally very important that they know about the planned downtime and thus valuable time is not wasted with ‘false alarms’. To ensure this CMK uses the concept of maintenance times. In English these are called Scheduled Downtimes (and in many locations you will occasionally see the shortened form Downtimes, which actually only means that a host is DOWN or a service is CRIT), but deliberately so.

So, if maintenance is required on an object, you can put it into maintenance – either immediately or for a selected period in the future. This is the same as for an acknowledgment, but in this case is entered in the Downtimes field:

There are a whole bunch of options for maintenance. A comment must be entered in every case. By selecting the appropriate button you can start and end a maintenance time. For example, with the 2 hours button the object is declared as ‘in maintenance’ for two hours starting from the current time. Unlike the acknowledgements, maintenance times always have an end time that is set in advance.

Here are some hints:

  • When you put a host into maintenance, all of its services are automatically considered to be in maintenance. You therefore save yourself the work of doing it multiple times.
  • If you use the  Checkmk Enterprise Editions, you can also define regular maintenance times (for example, due to a mandatory reboot once a week).
  • The flexible downtimes start automatically only when the object actually assumes a non-OK state.

Here are the effects of a maintenance time:

  • The views will display an icon for the affected hosts/services.
  • Alerting of problems is disabled during maintenance.
  • Affected hosts/services no longer appear as problems in the Tactical Overview.
  • Scheduled maintenance times are considered separately in the availability analysis.
  • At the beginning and at the end of a maintenance period, a special alert is triggered to inform you.

Further information about maintenance times can be found as always in its own article.

5. Fine tuning Monitoring

5.1. False Alarms – the death of every monitoring system

Monitoring is only really useful if it is precise. The biggest obstacle to acceptance among colleagues (and probably also yourself) is false positives, or simply false alarms.

With some Checkmk newcomers, we have found that they have included many systems into their monitoring within a short time frame – maybe because this so easy in Checkmk. When, shortly after implementation, the alert functions for all elements have been activated, operations staff have been flooded with hundreds of emails each day, so that after just a few days their enthusiasm for monitoring is permanently destroyed.

Even though Checkmk really makes an effort to have sensible defaults for everything, it simply cannot know precisely enough how to deal with the normal conditions in your IT environment. Therefore a bit of manual effort on your part is required to fine-tune your monitoring and to get rid of the last few false positives. Apart from this, CMK will identify a lot of real problems that you and your colleagues have not noticed. These must first be dealt with – by resolving the problems, not by adjusting the monitoring!

The following principle has therefore proved successful: first quality, then quantity. Or differently-expressed:

  • Do not include too many hosts in the monitoring system at once.
  • Make sure that all services that do not really have a problem are flagged reliably as OK.
  • Activate the notifications via e-mail or SMS only if Checkmk runs reliably for a while without any, or with very few, false alarms.

In this chapter we will show you what fine-tuning options you have available (so that everything turns green), and how to get a grip on the occasional misfires.

5.2. Rules-based Configuration

Before we go to the configuration, we briefly have to address the subject of settings for hosts and services in Checkmk. Because CMK has been designed for large and complex environments, its operation is based on rules. This concept is very powerful and brings many benefits even in smaller environments.

The basic idea is that you do not need to set every parameter for each service explicitly, but rather code something like: ‘On all Oracle production servers, when file systems prefixed /var/ora are at 90% fill-level flag WARN, and at 95% flag CRIT.’

Such a rule can in one fell swoop establish thresholds for thousands of file systems. At the same time it also very clearly documents which monitoring policies apply in your business.

Of course, you can also specify individual cases separately. A suitable rule might look like this: ‘On the server srvora123 the file system /var/ora/db01 at 96% fill receives WARN, and at 98% receives CRIT.’ This example can be called an Exception – but it is nevertheless a completely normal rule.

Each rule has the same structure. It always consists of one condition, and one value. In addition you can also include a title and a comment to document the function of the rule.

The rules are organized in rule chains. There is a separate rule chain for every type of parameter in Checkmk. For example there is one named Filesystems (used space and growth) which sets the thresholds for all services that monitor file systems. If Checkmk wants to determine which thresholds a particular file system check receives, it goes through all of the rules in this chain in turn. The first rule that satisfies the condition sets the value – so in this case the exact requirements for when the file system check flags a WARN or CRIT.

5.3. Configuring Rules

How does that look in practice? The normal method is via the Host & Service Parameters WATO module, which provides you with all known rule chains:

Here is the easiest way to get started with the search field. For example, type tablespace here so you can find all rule chains that have this text in the name or in the (not visible here) description:

The number with each name (here all 0) shows the number of rules in the respective chain. If you click on the name of the rule chain, you get the detailed view:

The rule chain shown here does not yet contain any rules. But with the Create rule in folder button you can create a rule. You can already define the first part of the condition of the rule: namely in which WATO folder this should apply. If you change the Main directory setting, e.g. on Windows, the new rule applies only to hosts directly in or below the Windows folder.

The creation (and of course the later editing) brings you to an input box with three fields: general, value and condition. In the Rule properties box all information is optional. In addition to the informative texts, you also have the possibility to temporarily disable a rule. That is handy because that's how you can sometimes avoid having to delete and create a new rule if you do not require one, but only temporarily.

Of course, what you find in the Value of a rule is completely up to you. As you can see here in the example, there can be quite a number of parameters. A typical case is as shown here: Each single parameter is activated by a checkbox, and the rule then alters only this parameter. You can allow a parameter to be set by another rule if that simplifies your configuration. In the example, only the thresholds for the percentage of free space in the tablespace is defined:

The field with the conditions looks a bit confusing:

The Condition type allows you to use predefined conditions that are managed via the Predef. Conditions button. This is a feature for ‘Power users’ who use a lot of rules which always have the same conditions. Let's just leave that on Explicit conditions for now.

You have already defined the Folder when you created it, but you can alter it again here.

The Host tags (host properties) are a very important feature of Checkmk: With this you can simply say that a rule should only apply for production systems. Because the host tags are so important, we'll dedicate a separate section to them right after this. To add a tag condition, first select a Tag Group in the selection list, followed by a click on Add tag condition.

Explicit hosts allows you to limit a rule to a few specific hosts.

Very important are the Explicit Tablespaces which restrict a rule to very specific services. Two points are important to note for this:

  • The name of this condition conforms to the rule type. If this is here Explicit Services, specify the names of the affected services. These can be e.g. Tablespace DW20 – including the word tablespace. In the example shown, on the other hand, you only want to specify the name of the tablespace itself, e.g. DW20.
  • The texts are always matched starting at the left! The example rule thus also applies to the fictitious tablespace DW20A. If you do not want this, put a $ at the end – e.g. DW20$. These are so-called regular expressions.

The labels, which you can also see in the screenshot, are treated in their own chapter in the manual.

After saving, exactly one rule will be found in the rule chain:

5.4. Host Tags

How Host Tags function

Above we have seen an example of a rule that should apply only for ‘production’ systems. More specifically, we usually have a condition that defines a Production system through the Host Tag. Why do you do that instead of simply using folders? Well, you can only define a single folder structure, and each host can only work in a single folder. But there are many very different features that a host may have, and the folders are simply not flexible enough.

Tags, on the other hand, can be assigned to the hosts completely freely and arbitrarily – no matter in which folder the hosts are located. Rules can then later refer to these tags. This not only makes configuration easier, but also easier to understand and less prone to error than if everything was explicitly set for every host.

But how and where to determine which hosts should have which tags? And how can you define your own tags?

Defining Tags and Tag Groups

Let's start with the second question: your own tags. First you have to know that tags are organized in groups: i.e. Tag Groups . Let us take Location as an example. A Tag Group could thus be called Location. And this group could have the characteristics Munich, Austin and Singapore. Basically, every host in each group has exactly one tag, so as soon as you define your own tag groups, each host without exception always has one of the tags from the group. Hosts for which you have not selected a tag from the group are simply assigned the first tag by default.

The definitions of the tag groups can be found in the WATO ➳ Tags WATO module.

As you can see, some tag groups are already predefined. Most of these you cannot change. We also recommend that the two predefined examples Criticality and Networking Segment are left alone. It is preferable to define your own groups – which is very easy.

Click New tag group, which brings you as expected to a form with multiple fields. In the first field you assign an internal ID, as so often in Checkmk – which serves as the key and which cannot be changed later – and a meaningful Title which you can customize later. The Topic only serves in the overview. If you assign a topic here, it will be displayed in a separate field in the host properties.

The actual tags are entered in the second field – the selection choices for the group. Again you assign an internal ID and a title to each tag:

Tips:

  • The IDs must be unique across all groups.
  • Groups with only one selection are allowed and are even useful. These will appear as checkboxes. Each host then either has the feature or not.
  • It is best to ignore the Auxiliary Tags.

Once you have saved, you can use the new tag group.

Assigning Tags to Hosts

You have already seen how tags are assigned to a host: in the Host Properties when creating or editing a host. In the Custom attributes field (or in your own, if you have assigned a topic) the new tag group will appear and there you can make a selection for the host:

As always, you can also set the tag to the folder and overwrite it on individual hosts as needed.

5.5. Finding Rule Chains more easily

There are many rule chains, and when searching it is not always easy to find the right one. But there is another way: If you have a certain service and want to modify its check parameters, click the menu, and select the Parameters for this service entry:

This takes you to a page where you have access to all of the rule chains for this service:

In the first field titled Check origin and parameters, the second entry (here CPU utilization on Linux/UNIX) takes you directly to the rule chain that sets the thresholds for this service.

5.6. Thresholds for file systems

Now that you have learned the basic principle of configuring services, in the rest of the chapter we will show you some important things that you should configure in a new Checkmk system in order to reduce false alarms.

The first are custom thresholds for monitoring file systems. By default in Checkmk, used disk space is set to 80% for WARN and 90% for CRIT. Now on a 2TB drive 80% is eqivalent to 400 GB still available – maybe that is a bit too much buffer. So here are a few tips:

  • Create your own rules in the Filesystem (used space and growth) chain.
  • The parameters allow thresholds that depend on the size of the file system. Select Levels for filesystems ➳ Levels for filesystem used space ➳ Dynamic levels. With the Add new element button you can now define your own threshold values appropriate to each drive's capacity.
  • It is even easier with the Magic Factor, which we will introduce in the Best Practices chapter.

5.7. Hosts which are allowed to go DOWN

It is not always a problem when a computer is turned off. A classic case is with printers. Monitoring printers with Checkmk makes sense – some users even manage the reordering of toner via Checkmk. However, switching off a printer before closing time is not a problem – it is rather positive in fact – it's just senseless if Checkmk alerts the situation when the corresponding host goes DOWN.

You can tell Checkmk that it is fine if a host is turned off. Search for it in WATO ➳ ➳ Host & Service parameters under the Host check command rule set. Place a rule there for all printers (depending on their structure, for example via a folder or via a matching host tag), and set its value to Always assume host to be up:

Now all printers are basically displayed as UP – no matter what their real status is.

The printers' services will still be checked, though, and would get a timeout and thus a CRIT. To avoid this, configure a rule in the Access to Agents ➳ Check_MK Agent ➳ Status of the Check_MK services ruleset, in which you set timeouts and connection problems to OK:

5.8. Switchports

If you monitor a switch with Checkmk, you will notice that in the service configuration a service will be created automatically for each port that is UP at the time. This is a sensible default setting for core and distribution switches – i.e., where only infrastructure devices or servers are connected. For switches connected to devices such as workstations or printers, this leads to constant alarms when a port goes DOWN, and conversely to new services constantly being found because a previously unmonitored port is now UP.

Here two approaches have become recommended practice. The first of these is to restrict monitoring to the uplink ports. Do this by creating a rule in the disabled services that excludes the other ports from the monitoring.

Much more interesting, however, is the second method. With it you monitor all ports, but allow the DOWN state as a valid state. The advantage: for ports where printers or workstations are attached you also have monitoring of transmission errors, and so can very quickly recognize bad patch leads or errors in auto-negotiation.

To use this second method you need two rules. The first rule is in the Parameters for discovered services ➳ Discovery – automatic service detection ➳ Network Interface and Switch Port Discovery chain. This rule determines the conditions under which switch ports should be monitored. Create a rule for the required switches, and activate it in Network interface port states to discover alongside 1 - up and 2 - down:

In the service configuration of the switches, the ports with the status DOWN are now also available, and you can add these to the service list. Now before you activate everything you of course need the second rule which ensures that this condition is considered OK. The rule chain is called Network interfaces and switch ports. Activate the Operational state option, uncheck Ignore the operational state, and in Allowed states check the states 1 - up and 2 - down (and possibly other states if needed).

5.9. Hosts that are rebooted regularly

Some servers are restarted at regular intervals – whether to patch, or simply because it is intended. You can avoid false alarms at these times in two ways:

In the  Checkmk Raw Edition you first define a Timeperiod covering the times of the reboot. You can find out how to do that in the article on timeperiods. Then place a rule in each of the Notification period for hosts and Notification period for services rule chains for the affected hosts, and there select the previously-defined time period. The second rule is necessary so that services which go to CRIT within this time period trigger no alarm. If problems occur (and then disappear) during these times, again no alarm will be triggered.

There are maintenance times in the  Checkmk Enterprise Editions – which are automatically repeated on a regular basis – that you can easily specify for the affected hosts.

Tip: As well as the method using commands that we showed under maintenance times, there is also a way through the Recurring downtimes for hosts rule set. This one has the big advantage that hosts that are initially planned to be added to the monitoring at a later date automatically get these maintenance times.

5.10. Permanently ignore services

For some services which are simply not reliably OK, it is in the end better not to monitor them at all. In such cases, in WATO you could just manually remove the services from the affected hosts from the monitoring by putting them back on Undecided or just leaving them there. This is, however, awkward and prone to errors.

It is much better if you define rules according to which certain services should systematically NOT be monitored. There is the Disabled services rule set for this in which you can, e.g. very easily create a rule in which the file systems with the mount point /var/test should not be monitored.

Tip: If you deactivate a single service in a host's service configuration by clicking on , a rule for the host will be created automatically only in this rule chain. You can edit this rule by hand and for example, remove the explicit hostname. The affected service will then be shut down for all hosts.

For more information about configuring services read its own article in the reference section.

5.11. Averages

One reason for sporadic alerts are thresholds on workload metrics – such as CPU utilization for example – which are only exceeded for a short time. As a rule such brief spikes are not a problem and thus should not be raised as alarms by the monitoring system.

For this reason a whole range of check plug-ins in your configuration include the option of averaging the measured values before applying the thresholds over a longer time frame. An example of this is the rule chain for CPU usage for non-Unix systems named CPU utilization for simple devices. Here is the Averaging for total CPU utilization option:

If you activate this and enter 15, the CPU load will first be averaged over a 15-minute period, after which the thresholds will be applied to this averaged value.

5.12. Getting a grip on sporadic errors

If nothing else helps – and some services occasionally just go into a problem status during an individual check (even if only for a minute) –, there is one last method that prevents false alarms. Here is the rule chain for this situation Maximum number of attempts to verify the service.

Create a rule there and set the value, e.g. to 3, so that for example, when a service goes from OK to WARN, at first no alarm will be triggered, and thus no problem is displayed in the Tactical overview at this time. Only when the status is not OK for three consecutive checks (which is a total elapsed time of just over two minutes), the problem will be considered ‘hard’ and will then be reported.

Admittedly, that is not an ideal solution, and you should always try to solve the problem at the root, but sometimes things are just as they are, and with the Check attempts you at least have a viable workaround in such cases.

5.13. New and discontinued services

A data center is constantly changing, and thus the list of monitored services will never stay constant. So that you do not miss anything, Checkmk automatically creates a special service on each host. This is Check_MK Discovery;

By default, every two hours this checks whether new (not yet monitored) services are found or existing ones have been dropped. If this is the case, the service will go to WARN. You can then open the service configuration in WATO and bring it back up to date.

Tip: Some users save a bookmark for a view that contains all of the discovery services on all hosts which are not in the OK state. These you can then work through regularly – e.g. once a day.

6. Working with multiple users

6.1. Users in Checkmk

Once you have your monitoring in a state where it runs, in order for it to become useful to others, it is time to familiarize yourself with user management in Checkmk. If you only operate the system yourself, working with cmkadmin is quite sufficient, and you can just read the next chapter covering alerting.

But let's say you have colleagues working with you who should use Checkmk. Why not all simply work as one? cmkadmin? Well, theoretically that works, but it does create a number of difficulties. If you create an account per person, however, you will have several advantages:

  • Individual users can create their own bookmarks, customize their sidebar, and customize other things for themselves.
  • Different users may have different permissions.
  • Users can be responsible only for certain hosts and services, and only need to see these in their monitoring display.
  • You can delete one user's account when they leave or change jobs, without affecting anyone else's account name or password.

As always you will find all of the details about users, rights and roles in its own article.

6.2. Permissions and Roles

These last two points need special explanation. Let's start with permissions – the question of which users are permitted to perform which actions. For this purpose Checkmk uses the usual concept of roles. A role is nothing more than a collection of permissions. Each of the permissions allows a very specific action. For example, there is a Permission to be able to change global settings.

Checkmk is supplied with three basic roles as standard. These are:

Role Abbreviation Function
Administrator admin A user with this role is allowed to do everything. Its main task is the general configuration of Checkmk, not the day-to-day operation of it. This of course includes creating users and customizing roles.
Normal monitoring user user This role is for a ‘normal’ user operations. They may only see such hosts and services for which they are responsible. There is also the possibility of giving the role the right to manage its own hosts in WATO itself.
Guest user guest A guest user is allowed to see everything, but not change anything. This role is, e.g. useful if you want to hang a status monitor on a wall to display an overview of the monitoring. Because a guest user cannot change anything, it is also possible for multiple colleagues to use that account at the same time.

How to customize roles is explained in the detailed user management article.

6.3. Contacts and Responsibilities

The second important aspect of users is defining Responsibilities. Who is in charge of the host mysrv024, or is responsible for the service tablespace FOO on the host ora012? Who should see this in the status interface, and possibly be alerted if there is a problem?

This is performed in Checkmk not via roles, but via Contact Groups. The word ‘contact’ is meant in the sense of an alert: Who should the monitoring system contact when there is a problem?

The basic principle is as follows:

  • Each user can be a member of any number of contact groups, including none.
  • Each host and service is a member of at least one contact group.

Here is an example of such an association:

As you can see, both a person and a host (or service) can be a member of several groups. Membership in the groups has the following effects:

  • A user with the user role sees precisely the objects in the monitoring system which are in one of his contact groups.
  • If there is a problem with a host or service, then by default all users who are in at least one of its contact groups are alerted.

Important: There is no option in Checkmk to assign a host or service directly to a user. This is deliberate because it leads to problems in practice – for example when a colleague leaves your company.

6.4. Creating Contact Groups

Creating new contact groups is very easy, and is performed in the Contact groups WATO module. A contact group with the name Everything is already predefined. This is assigned automatically to all hosts and services. The purpose of this is for a simple system setup in which there is initially no division of tasks among the administrators (or you in the case where you take on everything yourself).

Use New contact group to create a new group. Here, as always, you need an ID that is used internally as a key, as well as a title that you can change later. Here in the example you will see a contact group with the ID servers, and the title Windows & Linux Servers:

6.5. Assigning hosts

After you have created the contact groups, you must on the one hand assign hosts and services, and of course on the other hand assign users. The latter is what you do in the properties for the users themselves, which we'll see right after this.

There are two ways to assign hosts to contact groups – you can also choose both methods at the same time:

  1. Assignment using rules with the Assignment of Hosts to Contact Groups rule set
  2. Assignment via the properties of the hosts or folders in WATO

Assignment using rules

The rule set that you need for the first method is most easily found with the Rules button in the Contact groups module. But as always the search function via Host & service parameters also helps if you just search for contactgroups:

By the way, even with a fresh Checkmk installation the rule set is not empty. You will find a rule here that assigns all hosts of the above-mentioned group Everything. So create new rules here yourself, and choose the group you want to assign to the rule-selected hosts:

Important: If multiple rules apply to a host, all of the rules will be evaluated, and in this way the host will then receive several contact groups.

Assignment via WATO properties

The second method for assigning is to use the properties of a host in WATO. The procedure is as follows:

  1. Invoke the host properties in WATO.
  2. In the Basic settings box check the Permissions checkbox.
  3. Select one or more groups in the box Available, and move them to the right with the arrow buttons, into the Selected field.
  4. Enable Add these contact groups to the hosts.

The checkbox Always add host contact groups also to its services is not usually required, because services automatically inherit their host's contact groups. You will learn more about this later.

Of course, as always, you can also define this host property in the folder. The process is similar, except that this time there are a few extra checkboxes that you can simply leave in their default state.

6.6. Assigning services

You only have to assign services to contact groups if these groups differ from those of their host. However, there is an important principle: If a service has been explicitly assigned to at least one contact group, it will inherit no contact groups from the host.

This allows you to have a separation of server operations teams and applications teams, for example. If, for example, you plug the host srvwin123 into the windows contact group, but all services with the prefix Oracle are in the oracle contact group, the windows admins will not see the Oracle services, and conversely, the Oracle admins receive no details of the operating system's services – often a very useful separation.

If you do not need this separation, then simply create assignments for the hosts – and you're done!

If you nevertheless need an explicit assignment, this is done via the Assignment of services to contact groups rule set. The procedure is analogous to that described above, but as usual you give conditions for the service name.

6.7. Creating users

The administration of users can be found in the WATO Users module:

Do not be surprised if next to the cmkadmin entry there is also an automation user! This user is for requests from processes and scripts that are intended for the HTTP-API, and which are provided by the Checkmk system itself. For details see the reference.

If you have discovered the LDAP Connections button – should your company use Active Directory or another LDAP service –, you also have the option of including users and groups from these services. This will be described in its own article.

Create a new user with the New user button. This form is of course almost identical to the one you see when you edit an existing user (the icon next to the user), except that it is not possible to change an existing user's username:

As always, enter an ID and a title in the first field – here the advertised name of the user. The Email address and pager address fields are optional and are used for alerting via email or sms.

Note:Please do not enter any email address here. First read the notes in the chapter on alerting.

The second field concerns security and permissions:

Leave the setting on Normal user login with password and assign an initial password here. At the bottom you can assign roles to the user. If you assign more than one role, the user simply receives the maximum permissions from these roles (although for the three predefined roles this is not very useful).

In the third field you select the contact groups to which the user should belong. If you select the predefined Everything group, the user becomes responsible for everything, since this group contains every host and service:

By the way: The Personal Settings field contains precisely the settings – except for the password) – which the user can change themselves. Users of the guest role cannot change their settings, so here there is the possibility, e.g. of setting the language or the User interface theme.

7. Notifications

7.1. The Basics

In Checkmk Notification means that users are actively notified when the state of a host or service changes. Let's say, at some point, on the host mywebsrv17 the service HTTP foo.bar changes from OK to CRIT. Checkmk recognizes this and, for example, sends an email with the most important data for this event to all contact persons for this service. Later the service again changes its state from CRIT to OK, so the contacts will receive a new email for the event – this time called Recovery.

But this is just the simplest way of alerting, and there are many possibilities for refining it:

  • You can alert via SMS, pager, Slack or other Internet services.
  • You can set alerts to certain time windows (standby).
  • You can define escalations if the responsible contact does not react quickly enough.
  • Users can autonomously ‘subscribe’ to or unsubscribe from notifications if you want to allow them.
  • You can generally use complex rules to specify who should be alerted about what, and when.

However, before you start using notifications, you should be aware of the following:

  • Notifying is an optional feature. Some organisation have a control desk that is staffed around the clock and which works only with the status view.
  • Initially enable notifications only for yourself, and make yourself responsible for everything. For a few days or weeks observe how big the volume of alarms is. Tune your monitoring.
  • Do not enable alerts for your colleagues until you have minimized false positives (false alarms).

7.2. Preparing email dispatching

The simplest and by far the most common procedure is alerting by e-mail. This is easy to set up, and in an email there is enough ‘space’ to include any graphs of the measured data to be sent.

Before you can alert by email, your Checkmk server needs to be set up for sending mail. For all supported Linux distributions this is performed using one or other of the steps below:

  1. Install an SMTP server service. This usually takes place automatically when installing the distribution.
  2. Specify a smarthost. Again, you are usually asked this when installing the distribution. The smarthost is a mail server in your company that handles the delivery of emails for Checkmk. Very small companies usually do not have their own smart host. In such cases you use the SMTP server provided by your email provider.

If the mail delivery is set up correctly, you should be able to send an email using the command line – with this command for example:

OMD[mysite]:~$ echo ‘Testcontent’ | mail -s Test harri.hirsch@example.com

The email should be delivered without delay. If this does not work, you will find information in the /var/log directory in the SMTP server's log file. More details on setting up mail services on Linux can be found in the reference section of the manual.

7.3. Activating notifications via e-mail

If the sending of email works in principle, then the activation of notifications is very easy – you may already have done it without realising it when creating the users. For a user to receive notifications the following two steps are necessary:

  • An email address must be entered in the user's properties.
  • The user must be responsible for hosts or services (via the appropriate contact groups).

7.4. Testing notifications

It would be a bit cumbersome to test notifications by waiting for a real problem to occur or even by provoking one. Testing is easier using the Fake check results command. These are found in the same way as the acknowledgements or the maintenance times.

Important: This box is only visible if you have the admin role.

It is best to choose a service that is currently OK and set it manually to CRIT. This should immediately trigger an alert. After one minute at the latest – when the next regular check is executed – the service then reverts by itself to OK, and a second alarm should be triggered – the Recovery.

7.5. Suppressing notifications

If you do not receive an email, it does not necessarily indicate an error, because there are many situations in which Checkmk notifications are deliberately suppressed:

  • If a host is DOWN, no alerts will be triggered on its services!
  • If you turned off notifications in the Master Control snap-in.
  • When a service or host is in a maintenance time.
  • If a service has recently been changing between different states too often, and the service has thus been marked as flapping! This can happen quickly if you constantly change the state using Fake check results!

7.6. Customizing the notifications

You can customize notifications in Checkmk in many different ways, and define very complex rules for who, when and how should be notified. All details can be found in the reference section of the manual.

7.7. Troubleshooting

The notification module in Checkmk is very complex – simply because it covers many very different requirements that have proven to be important in over 10 years of field experience. The question “why has Checkmk not notified here” is thus asked more often by beginners than you may have suspected. This is why you will find some troubleshooting tips here.

If a notification from a particular service has not been triggered, the first step is to look at the History of the service. You will find this if you go to the service's detail page in the status interface, and click on History. There you will find all events for this service listed chronologically from the newest to the oldest. Here is an example of a service that was trying to trigger an alert, but mail delivery did not work (because no SMTP server is installed):

For more information see the var/log/notifiy.log file. You can for example, monitor this continuously in a terminal with the less command, or with the tail -f command. The latter is useful if you are only interested in new messages – i.e. those which were created after entering the tail command. Do not forget to first switch to your instance user with su - :

root@linux# su - mysite
OMD[mysite]:~$ 

You can now open the file with less:

OMD[mysite]:~$ less var/log/notify.log

If you are not yet familiar with less, press Shift-G to jump to the bottom of the file (this is always useful in log files), and exit less with Q.

Here is a snippet from notify.log for a successfully-triggered alert:

/var/log/notify.log
2019-09-05 10:21:48 Got raw notification (server-linux-3;CPU load) context with 71 variables
2019-09-05 10:21:48 Global rule 'Notify all contacts of a host/service via HTML email'...
2019-09-05 10:21:48  -> matches!
2019-09-05 10:21:48    - adding notification of martin via mail
2019-09-05 10:21:48 Executing 1 notifications:
2019-09-05 10:21:48   * notifying martin via mail, parameters: (no parameters), bulk: no
2019-09-05 10:21:48 Creating spoolfile: /omd/sites/mysite/var/check_mk/notify/spool/cbe1592e-a951-4b70-9bac-0141d3d74986

If you want to go deeper into the subject of notifications, you will find all the relevant details in the reference part of the manual.

8. Extending the monitoring system further

With the setting up of notifications you have completed the last step, and your Checkmk system is ready! The possibilities within Checkmk are of course not yet exhausted at ths point. There are many more ways to continue the expansion of your monitoring.

8.1. Optimizing security

Even if monitoring is ‘only watching’, the subject of IT security is also important. In the reference section you will find a security overview article which will give you tips on how to optimise your system's security.

8.2. Monitoring very large environments

If your monitoring has reached an order of magnitude where you are monitoring thousands of hosts, or even more, architecture and tuning issues become interesting. The most important topic here is distributed monitoring. With this you work with multiple Checkmk instances that interconnect into a large system – which may even be distributed globally.

8.3. Availability and SLAs

With the availability module, CMK can very precisely calculate the availability of hosts or services in specific time periods, how many failures occurred – and their durations, and much more.

With the SLA module included in the  Checkmk Enterprise Editions, Checkmk can verify compliance with service level agreements, and even actively monitor these.

8.4. Hardware and software inventory

The hardware/software inventory does not really belong to the topic of monitoring, but using the already installed agents Checkmk can provide extensive information on the hardware and software of your monitored systems. This is very helpful for maintenance, license management, or the automatic loading of data into Configuration Management Databases.

8.5. Monitoring messages and events

So far we have only been monitoring the current states of hosts or services. A completely different topic is the evaluation of spontaneous messages which, e.g. appear in log files, or are sent by syslog or SNMP traps. Checkmk has a complete, integrated system called the Event Console.

8.6. Visualization using maps and diagrams

With the NagVis add-on integrated in Checkmk you can represent any states with maps or diagrams. This is great for creating appealing overviews – for screens in control rooms for example.

8.7. Business Intelligence

With the Business Intelligence module you can derive and clearly present the overall state of business-critical applications,based on the many individual status values provided by Checkmk

8.8. Generating PDF reports

The reporting module Checkmk included in the  Checkmk Enterprise Editions enables the creation of PDF reports for clearly displaying information on past periods, events, availabilities and much more.

8.9. Automatic agent updates

If you monitor many Linux and Windows servers, you can keep your monitoring agents and their configurations at the desired level with the agent-updater contained in the  Checkmk Enterprise Editions, from a centralised base.

8.10. Developing your own plug-ins

Even though Checkmk delivers almost two thousand check plug-ins, it can always be the case that a specific plug-in is missing. How to develop such a plug-in yourself can be found in its own section in the manual.

9. Best Practices, Tips & Tricks

9.1. CPU single-core utilization

Checkmk automatically sets up a service on both Linux and Windows which monitors the average CPU usage over the last minute. This of course makes sense, but it fails to recognize a number of problems – for example, when a single process runs amok and permanently loads one CPU core at 100%. For a system with 16 CPU cores a single core contributes only 6.25% to the overall performance, and so in extreme cases like this one a load of only 6.25% is measured – which of course does not lead to an alert.

Checkmk therefore offers the possibility (for both Windows and Linux) to monitor all existing CPU cores individually and determine if any is permanently busy for a long time. Setting up this check has turned out to be a good idea.

To set this up for your Windows server, add a rule to the CPU utilization for simple devices chain. This rule is actually responsible for the monitoring of all CPUs. There is an option here called Levels over extended periods on a single core CPU utilization. In general, only activate this option:

Define the rule condition so that it only applies to the Windows server, e.g. through a suitable folder or host tag. This rule does not affect other rules in the same chain if they set other options, e.g. the thresholds for total utilization.

The additional validation will be found in the existing service CPU utilization.

For this function Linux servers use the CPU utilization on Linux/UNIX rules chain – where you find exactly the same option.

9.2. Monitoring Windows services

Checkmk does not by default monitor services on your Windows servers! Why not? Well, because it is not automatically clear which services are important to you.

If you do not want to bother to manually specify which services are important for each server, you can also set up a check that simply checks if all services with automatic Startup are actually running. In addition you can be informed whether manually-started services really have started. A problem could result since of course these services will not automatically be running after a reboot.

To do this you'll first need a rule in the Windows Services chain, which you can always find with the search function. The crucial option in this rule is Service states. Activate this and add three elements:

This gives you the following definitions:

  • A service with startup auto if running is considered OK.
  • A service with an auto startup that is not running is considered CRIT.
  • A service with startup Demand if running is considered WARN.

However, this rule only applies to services that really become monitored! That is why we now need a second step: Create a new rule in the Windows Service Discovery chain. This controls which Windows services Checkmk automatically suggests as monitored services.

When you create this rule, first in the Services (Regular Expressions) field you can enter the regular expression .* that matches all services. If you save, and then in WATO switch to the service configuration for a suitable host, you will find a large number of new services – one for each Windows service.

To limit the number of monitored services, return to the rule and refine the search terms as needed. This is case-sensitive! Here is an example:

If you have already included the services in the monitoring configuration, they will now appear as missing. With the Automatic refresh (tabula rasa) button, you can clear the table and regenerate the whole list.

9.3. Monitoring the Internet connection

Of course, your company's access to the Internet is very important to everyone. The supervision of this is somewhat unusual, since there is not ‘the Internet’, but rather billions of hosts. However, you can still set up monitoring very efficiently according to the following blueprint:

  1. Select multiple Internet ping destinations that should normally be reachable and record their IP addresses.
  2. In WATO create one host called Internet.
  3. Enter one of the IP addresses for this host as an IPv4 address.
  4. Enter the other addresses for the same host under the Network address ➳ Additional IPv4 addresses option.
  5. Also set Data sources ➳ Check_MK Agent to No agent.
  6. Create a rule under Active checks (HTTP, TCP, etc.) ➳ Check hosts with PING (ICMP Echo Request) which only applies to this host.
  7. In this rule activate Service description, and enter Internet connection in the service name field.
  8. Also enable Alternative address to ping, and select Ping all IPv4 addresses.
  9. Activate Number of positive responses required for OK state and enter 1.
  10. Create another rule – this time under Monitoring Configuration ➳ Host check command – which also applies only to the host Internet.
  11. In the Host check command field, select Use the status of a service ..., and enter the service name Internet connection which you defined in step 7.

If you now activate the changes, you will receive a new host with the name Internet with only the Internet connection service. If at least one of the ping destinations is reachable the host will have the status UP and the service will have the status OK. Simultaneously, from the service you will get the data for the typical round trip time from each of the ping targets, as well as the packet loss, and thus also get an indication of the quality of your connection over time:

Steps 10 and 11 are necessary so that the host does not get the state DOWN if the first IP address cannot be reached by a ping. Instead the host always takes the status of its only service.

Important: Because a service is generally not alerted when its host is DOWN, it is important that you make the notification relate to the host – not the service. In addition you should use an notification method that does not require an Internet connection!

9.4. Monitoring HTTP/HTTPS services

Let's say you want to check the accessibility of a website or web service. The normal Checkmk agent does not provide a solution because it does not display this information – and you may also not have the possibility of installing the agent on the server.

The solution for this is a so-called active check. This is one that is not performed by an agent, rather by contacting a network protocol directly at the destination host – in this case HTTP(S). The procedure is as follows:

  1. Create the destination server as a host in WATO. Let's give it the name tribe29.com.
  2. In Data sources ➳ Check_MK Agent, select No agent and save it without service detection.
  3. Now create a rule in the Active Checks (HTTP, TCP, etc.) ➳ Check HTTP service rule set for this host (eg with Explicit hosts or an appropriate host tag).
  4. In the Check HTTP service box you will find many options for how to perform the check. More on this later.
  5. Save the rule and activate the changes. Now you will get a new host with a service that checks access via HTTP(S).

The options for this rule include the following:

  • In Virtual host you may be required to specify a domain of the server if it hosts more than one domain.
  • The Use SSL/HTTPS for the connection option allows monitoring of HTTPS.
  • Expected response time allows you to set the service to WARN or even CRIT if the response time is too slow.
  • The Fixed string to expect in the content option allows you to check the answer for a specific text in the delivered page. You should always check a relevant part of the content, so that a simple error message from the server is also considered a positive response.

By the way, you can of course also perform the HTTP check on a host that is already being monitored by a Checkmk agent. In this case creating the host is omitted and you just need the correct rule.

9.5. Intelligent file system thresholds

Finding good thresholds for monitoring file systems can be a bit tedious and require a lot of rules. A threshold of 90% is much too low for a very large drive, and it may be too high for a small drive. In addition to the method mentioned in the chapter about tuning, there is another more practical way to define thresholds depending on the size of the drive: the Magic factor. It works like this:

  1. In the Filesystems (used space and growth) rule set, you apply only one rule, with thresholds of 80% and 90% respectively.
  2. In the same rule enable Magic factor (automatic level adaptation for large filesystem), and enter 0.8.
  3. Also enable Reference size for magic factor and enter 100 GB as the size.

If you enable now enable the rule, you will get thresholds that automatically depend on the size of the file system:

  1. File systems that are exactly 100 GB receive the thresholds 80%/90%.
  2. File systems that are larger than 100 GB get higher thresholds which are closer to 100%.
  3. File systems that are smaller than 100 GB get lower thresholds – i.e. ones below 80%/90%.

How high the thresholds exactly are is – well, magical! The factor (here 0.8) determines how strongly the values can be adjusted. A factor of 1.0 does not change anything, and all drives get the same values. Smaller values bend the thresholds more. Which exact thresholds apply can easily be seen in each service's status text:

The following table shows some examples of the resulting thresholds for a reference size of 100 GB:

Drive capacity mf = 1.0 mf = 0.9 mf = 0.8 mf = 0.7 mf = 0.6 mf = 0.5
800 GB80% 84% 87% 89% 91% 93%
300 GB80% 82% 84% 86% 87% 88%
100 GB80% 80% 80% 80% 80% 80%
50 GB80% 82% 83% 85% 86% 87%
5 GB80% 73% 64% 51% 50% 50%