Checkmk Conference #6 goes digital. Get your tickets here!
1. Basic configuration
1.1. Initializing at the first start
At this point you should have configured and started the appliance on a rack or as a virtual machine, either with VirtualBox or VMWare ESXi. During the first start there will be a message about the desired language. This selection will be set for the whole device.
You will then be displayed a message asking you to initialise the data medium. Confirm this dialogue box and wait for device startup to be resumed and for the status screen to then be displayed.
Once the device has been started up, you will see the status screen on the local console.
1.2. Network and access configuration via the console
From the status screen you can get to the configuration menu by pressing the F1 key.
For starting up the device you now need to set up the network configuration and specify the device’s password.
Use this dialogue box to set up the network configuration of the device. The IP-address, netmask and the optional default gateway will be queried.
In most cases the device will need to also access network devices outside of its own network segment. The default gateway must also be configured for this purpose.
Once these values have been entered the configuration will be activated – meaning the device will be immediately reachable via the network and at the entered IP address. One way of testing this is to send a ping from another device in the network.
Access to the web interface
A large part of the device’s configuration is carried out via the web interface. Access to this web interface is protected by a password – the device password – which you need to specify first.
The factory settings do not include a device password, which means that you cannot yet access the web interface. Press the F1 key on the status screen, and select device password in order to set the password.
Then select the web interface option from the configuration menu to activate the web interface.
Once you have completed these steps you will see the configured IP address in the device information box, and web interface: on in the access box on the the console’s status screen – as shown in the screenshot above. If you have already connected the device with your network correctly, you will now see that the network connection is active (Network: UP in the Status box).
Protecting access to the console
When you started the appliance, you may have noticed that there was no password prompt.
Following a basic configuration access to the local console is not automatically protected. This means that anyone who has direct access to the rack or the management interface of the virtualization solution is able to change the basic configuration of the device. Therefore you should activate password protection via the Console Login menu.
Once you have completed these steps you will see console login: on in the access box on the console’s status screen.
In order to protect such access, you have the option to activate a login that must be used in conjunction with the device password before settings can be made or the current status viewed. If you have already activated the web interface, you will have already set a device password and do not need to assign a new one. If you have not done so, open the configuration menu on the console by pressing the key F1 and select device password to set the password.
Then select console login from the configuration menu in order to activate this option.
1.3. Basic settings on the web interface
Once you have configured the access to the web interface you can now access the web interface via a web browser on a computer connected to the device via the network. To do this enter the appliance URL in the address bar of the browser – here for example http://192.168.178.111/. Here you can see the login screen of the web interface.
After you have logged in with the password previously specified for the web interface the main menu will open. From here you can access all of the features of the web interface.
In this menu select the item device settings. In this dialogue box you can see and change the most important device settings.
By clicking on the titles of the settings you will be taken to the dialogue box for adjusting the respective setting.
If you have DNS servers available in your environment you should first now configure one or more of these so that the resolution of host names can be used. If you have one or more NTP servers for time synchronisation available in your environment, enter these as IP addresses or hostnames under the NTP server item.
If emails are to be sent from your device – such as notifications in the event of problems being detected – you must configure the Outgoing Emails option. To do this enter the SMTP relay server responsible for this device and any access data required. All emails generated on the device will be sent to this server. Under this setting, you can also configure all emails generated by the device’s operating system (e.g. in the case of critical errors) to be sent to a particular email address.
The basic configuration of the device is now complete and you can continue with setting up the first monitoring instance.
2. Administrating monitoring instances
2.1. Creating new site
Open the main menu of the web interface and click on the instance administration menu item. In this dialogue box you have access to all monitoring instances of this device. You can configure, update and delete monitoring instances, as well as create new ones.
The first time you open the dialogue box it will be empty. To create your first monitoring instance, click on the Create instance button. In the dialogue box that then appears you can specify the initial configuration of the monitoring instance.
Start by entering an instance ID which serves to identify the monitoring instance. The instance ID may only contain letters, numbers, - and _, must start with a letter or a _ and may be a maximum of 16 characters in length.
Now select the Checkmk version with which to create the monitoring instance.
You can leave all other settings as they are for the time being. You can change these settings later on using the instance editing dialogue.
As soon as you confirm the dialogue box with Create instance the new monitoring instance will be created. This may take a few seconds. Once the instance has been created and started you will be taken to the list of all monitoring instances.
In this list you will see the instance just created with the ID mysite. You can also see the status of the instance, where running means the instance has been fully started. You can start or stop the instance with the button to the right of the status. On the left you will see various icons with which you can a) edit the settings of the instance, b) update the instance and c) delete the instance.
After the instance has been created and started you can either click on the instance ID or enter the URL for the monitoring instance – in this case 192.168.178.111/mysite) – into the address bar of your browser.
You will now see the login screen of the monitoring instance where you can log in using the access data you entered when creating the instance. Once you have logged in you can set up Checkmk in the usual manner. The snap-in Checkmk Appliance is available in all monitoring instances and for all administrators. You will find it in the sidebar. This snap-in will take you from your monitoring instances directly to the web interface of the device.
2.2. Migrating existing sites
It is a commonly required to migrate existing sites from other Linux systems to a Checkmk appliance. The Checkmk appliance offers a migration dialog which performs the migration for you.
The following requirements need to be met:
- You need to have a network connection between the source system and your device.
- The Checkmk version of the source site needs to be installed on your device (architecture changes from 32-bit to 64-bit are possible).
- The source site needs to be stopped during the migration.
Open the main menu of the web interface and click on the Site Management menu item. Then click on the Migrate Site button.
In this dialogue you first need to configure the host address (host name, DNS name, IP address) the source system which you want to migrate the site from. Next you need to enter the site ID of the site you want to be migrated.
The migration of the site is done via SSH. To get access to the source site, you need to provide the credentials of a user which is able to connect to the source system and access all of the source site’s files. You can use the root user of the source system or, if you have configured a password for the site user, you can use the site user credentials.
Optionally you can choose to let the migration create the site with a new site ID on your device, or carry the original ID over to the new device unchanged.
Additionally you have the option to skip the carrying-over of performance data (measurements, graphs) and historic monitoring data during the migration. This can be useful if you don’t need an exact copy of the source site and only want to copy it – e.g. for testing purposes.
After filling the dialogue and confirming it by clicking the start button, the following dialogue will show you the progress of the migration.
After completion of the migration you can finish the migration by clicking on Complete. You will be returned to the instance administration dialogue where you can start and manage the newly imported site in the usual way.
3. Administrating Checkmk versions
It is possible to install several Checkmk versions on the device at the same time. This allows several instances to be run in different versions, and for individual instances to be changed to newer or older versions independently of one another. This means that you can install a new version for example and try it out initially in a test instance in order to then update your production instance if the testing is successful.
To administrate the Checkmk versions select the Checkmk versions item in the main menu of the web interface.
The dialogue box which then appears will list all installed Checkmk versions. If a version is not being used by any instance and is not the last-installed version, you have the option to delete it from the appliance.
Using this dialogue box you can also upload new Checkmk versions onto the device in order to use them in new instances or to update existing instances.
To do this, download the desired Checkmk version from our website onto your computer (search for Checkmk appliance in the Distribution column). Then select the file from your hard drive using the file selection dialogue and confirm your selection by clicking on Upload & Install.
The Checkmk version will now be uploaded onto the device. Depending on the network connection between your computer server this may take a few minutes. Once uploading is complete you will see the new version in the table of installed versions.
4. Firmware installation
You can update the software of your device to a newer version or downgrade to an earlier version. Both are carried out via the so-called firmware update in the web interface.
First, download the desired firmware package from our website onto your computer. Then open the device’s web interface and select the Firmware update item in the main menu.
In the box that appears select the firmware package you downloaded before from the hard drive. Confirm the dialogue box by clicking on Upload & install. The package will now be loaded onto your device. Depending on the network connection, this may take a few minutes.
Once the package has been recognised as valid firmware you will be shown a dialogue box asking you to confirm the firmware update. Depending on the version differences between the current version and the one to be installed various messages will appear telling you what to do with your data during the update.
- Change to the first digit of the version number: You must back up the data of your device manually and restore it after the update. An update cannot be performed without data migration.
- Update to higher number for the second digit: The update can be carried out without data migration. You are advised to back up your data beforehand anyway.
- Downgrade to lower number for the second digit: You must back up the data of your device manually and restore it after the update. An update cannot be performed without data migration.
- Change in the third digit: The update can be carried out without data migration. You are advised to back up your data beforehand anyway.
When you confirm this dialogue box the device will be restarted immediately. When the device is restarting the firmware last uploaded will be installed. This will cause restarting to take much longer than usual. It will normally take less than 10 minutes however. A further restart will be carried out after the installation has taken place. This will complete the firmware update.
5. Device settings
5.1. Changing the language
During the basic configuration you specified the language for your device. You can change this at any time, either via the console configuration or via the device settings in the web interface. Like all other settings in this dialog changes will be effective immediately when saved.
5.2. Changing the network configuration
During the basic configuration you specified the network configuration of your device. You can change this at any time, either via the console configuration or via the device settings in the web interface. If you made an error when specifying the network configuration and the device is no longer accessible via the network you can only correct the settings on the console.
5.3. Configuring host and domain names
Host and domain names serve to identify a computer in the network. When sending emails for example, these names are used to form the sender address. In addition, the configured host name is added as a source host to all log entries that are sent to a syslog server. This makes it easier to assign the entries.
5.4. Configuring name resolution
In most environments DNS servers are used to translate IP addresses into host names and vice versa. Host names or FQDNs (Fully Qualified Domain Names) are frequently used for monitoring instead of IP addresses.
In order to use the name resolution on your device, you must configure the IP addresses of at least one DNS server in your environment. It is recommended to enter at least two DNS servers.
5.5. Configuring time synchronisation
The system time of the device is used for many purposes, such as for recording measurement data or writing log files. A stable system time is therefore very important. This is best ensured by using a time synchronisation service (NTP).
5.6. Forwarding syslog entries
Log messages are generated on the device by the operating system and some permanently running processes. They are initially written into a local log via syslog.
You can also send these entries to a central or higher-level syslog server where they can be evaluated, filtered or archived.
Select the item Syslog to configure the forwarding.
In the dialogue box that appears next you can configure which protocol you wish to use for forwarding. Syslog via UDP is more widely used, but not as reliable as via TCP. So if your syslog server supports both protocols it is recommended to use TCP.
5.7. Changing the default web page
If you access the host address of the device directly via the web browser without entering a path by default you will be taken to the device’s start page. However it is also possible for you to be forwarded directly to a monitoring instance of your choice.
You can configure this using the setting HTTP access without URL. Via this setting, select the monitoring instance to open instead of the web interface. The Appliance home page can then be reached via the URL along with the path – for example 192.168.178.111/webconf.
5.8. Configuring outgoing emails
So that you can send emails from the device (in the case of events during monitoring for example), the forwarding of emails to one of your mail servers must be configured using Outgoing Emails.
In order for the sending of emails to work you must have at least configured the host address of your mail server as an SMTP relay server. This server will then receive the emails from your device and forward them.
However configuring the SMTP relay server is only sufficient as long as your mail server accepts emails via anonymous SMTP. If your mail server requires authentication, then you need to activate the appropriate login method under the Authentication item and indicate the access data of an account that can log onto the mail server.
If you do not even receive any emails after the configuration it is worth taking a look at the device’s system log. All attempts to send emails are logged here.
The device itself can send system emails if there are critical problems (e.g. a job cannot be executed or a hardware problem has been detected). In order to receive these emails you must configure an email address to which these emails are to be sent using Send local system mails to.
5.9. Changing access to Checkmk agents
A Checkmk agent is installed on the device and in the basic setting can only be queried by the device itself. You can use it to create an instance on the device and directly add the device to the monitoring.
It is also possible to make the Checkmk agent accessible from another device, meaning the device can also be monitored by another Checkmk system (e.g. in a distributed environment by a central server). For this purpose, you can configure a list of IP addresses that are allowed to contact the Checkmk agent.
6. Remote access via SSH
6.1. Access options
You can activate various access types for the SSH remote management protocol. Basically
- access to the console and
- direct access to the sites
6.2. Activating instance login via SSH
You can activate access to the command line of the individual monitoring instances, enabling you to view and control the entire environment of the instance.
This access is controlled via the instance administration. In the settings dialogue of each individual instance you can activate and deactivate access as well as set a password to protect access.
6.3. Activating console via SSH
It is possible to activate access to the console of the device via the network, enabling you to view and adjust the basic configuration of the device even without direct access to the device.
You can enable access via the configuration dialogue of the console. To do this, select the menu item Activate console via SSH.
When you activate this option, you will be asked to enter a password. You must enter this password if you are connecting as a setup user via SSH. Access will be automatically enabled directly after confirming this dialogue.
You can now connect to the device as a setup user using an SSH Client (e.g. PuTTY).
6.4. Activating root access via SSH
It is possible to activate access to the device as a root system user. Once the device has initialised however this access will be deactivated. Once activated you can log onto the device as a root user via SSH.
Commands you execute on the device as root can cause lasting alteration or damage, not only to your data, but also to the delivered system. The manufacturer shall accept no liability for alterations you make in this way. Only activate and use the root user if you are sure what you are doing and only for diagnostic purposes.
You can enable access via the configuration dialogue of the console. To do this, select the menu item Root access via SSH.
Then set the option to enable.
When you activate this option you will be asked to enter a password. You must enter this password if you are connecting as a root user via SSH. Access will be automatically enabled directly after confirming this dialogue.
You can now connect to the device as a root user using an SSH Client (e.g. PuTTY).
7. Protecting the appliance-GUI via TLS
7.1. Setting up TLS access
By default the web interface of your device is accessed via HTTP in plain text. You can protect this access via HTTPS (TLS), so that data is transferred including encryption.
7.2. Installing a certificate
In order to encrypt data traffic the device next needs a certificate and a private key. There are several ways available for you to install these.
- Create a new certificate and have it signed by a certification authority by sending a certificate signing request (CSR).
- Upload an existing private key and certificate.
- Create a new certificate and sign it yourself.
You can choose one of the options above that fits your requirements and possibilities. Certificates signed by certification authorities generally have the advantage that clients can automatically verify the authenticity of the host (device) at the time of access. This is normally the case with official certification authorities.
If a user accesses the web interface via HTTPS and the certificate is either self-signed or signed by a certification authority not trusted by the user, this will cause a warning to appear in the user’s web browser first.
Creating a new certificate and having it signed
To create a new certificate, select the option New certificate. In the dialogue box that follows, you now enter device and operator information, which is then stored on the certificate and can be used by both the certification authority and clients later on to verify the certificate.
Once you have confirmed the dialogue box with Save, you can download the certificate signing request (CSR) file from the web access page. You must provide this file to your certification authority. You will then receive a signed certificate from your certification authority and, where necessary, a certificate chain (often consisting of intermediate and/or root certificates). You will usually receive these in the form of .pem or .crt files or directly in PEM-encoded text form.
You can now transfer the signed certificate to the device via the Upload certificate dialogue. If you have received a certificate chain you can likewise upload it via this dialogue.
Once you have confirmed the dialogue with Upload you can continue configuring the types of access.
Creating a new certificate and signing it yourself
To create a new certificate select the option New certificate. In the dialogue box that follows you now enter device and operator information, which is then stored on the certificate, and which can later be used by clients to verify the certificate.
In the last section Signing method you now select Create a self-signed certificate. After that you can specify the maximum validity period of the certificate.
Once this validity period has expired you must generate a new certificate. This should be done in good time before the expiration so that there are no problems accessing your device.
Once you have confirmed the dialogue with Save you can continue configuring the types of access.
Uploading existing certificate
If you have an existing certificate along with a private key and wish to use this to protect HTTPS traffic, you can transfer these files to your device via the Upload certificate dialogue.
7.3. Configuring access types
Once you have installed a certificate you can now configure the access types according to your requirements.
If you wish to protect access to your device via HTTPS you are recommended to select the HTTPS enforced (incl. redirect from HTTP to HTTPS) option. The device will only respond via HTTPS, but will redirect all incoming HTTP requests to HTTPS. This means that users who inadvertently access the web interface via HTTP, either directly or via bookmarks, will automatically be redirected to HTTPS.
If it is very important that not a single request goes over the net in plain language, you can select the option HTTPS only. This setting will cause users accessing via HTTP to receive an error message.
You can also have a simultaneous configuration of HTTP and HTTPS. However this setting is only recommended in exceptional cases, for migration purposes or for testing.
7.4. Displaying current configuration/certificates
On the access type configuration page you can see the types of access currently active as well as information regarding the current certificate.
8. Device control
8.1. Restarting / Shutting down
You can restart or shut down the device over both the web interface and the console.
In the web interface you will find the menu items Reboot device and Shutdown device under the point Control device in the main menu. The device will execute the action immediately after the command has been selected.
In the console you can open the device control menu by pressing F2.
8.2. Restoring factory the configuration
You can reset your device to its factory settings. This means that any changes you have made to the device (e.g. your device settings, monitoring configuration or recorded statistics and logs) will be deleted. When resetting the settings the firmware version currently installed will be retained – the firmware installed with the device as delivered will not be restored.
You can perform this action on the console. To do this press the F2 key on the status screen and select Factory Reset in the dialogue box that follows. Confirm the next dialogue box by clicking on yes. Your data will now be deleted from the device and the device then restarted immediately. The device will now start with a fresh configuration.
In order to preserve your monitoring data in case of a hardware failure or similar destruction, a backup of your data can be configured via your appliance’s web user interface.
To be certain the data really is backed up it must be saved to another device – a file server for example. For this, via mount management, first configure the network file sharing to be used for the backup. This will be defined as the target when configuring the data backup. Once this is completed a backup job can be created that at predefined intervals saves a backup of your system to the shared network.
The full backup includes all of the configurations defined on the system, installed files, and likewise your monitoring instances.
The backup is executed (online) during active operations. This can however first be fully-realised when all monitoring instances on the appliance use Checkmk 1.2.8p6, 1.4.0i1 or a Daily-Build from or newer than 22.07.2016. Active instances using older versions will be stopped before, and restarted after the backup.
9.2. Automatic backup
To set up an automatic data backup, configure one or more backup jobs. A backup data set must be created on the target system for each backup job. When each new backup is completed, the previous backup will be deleted – meaning that on the target system double the storage allocation will be temporarily required.
9.3. Configuring the backup
With help from the file system management first configure your network sharing. In our example a network sharing is configured under the file path /mnt/auto/backup.
Next, select the Device backup item in the web interface’s main menu, and in the next menu open the Backup target. Then create a New backup target. The title and the ID have a free syntax. Under the Target directory for backup item configure the mounted network sharing’s data path - in this case /mnt/auto/backup. The Is mountpoint option must be active if you are backing up to a network file sharing – this verifies to the backup that the file sharing really is mounted.
Once the backup target has been created, return to the Device backup page and from there select New job. Here again you can choose an ID and a title. Next, select the newly-created backup target and define the desired periods for running the backup.
After saving you will see an entry for your new backup job on the Appliance backup page. The scheduled time for the next execution will be shown at the end of this line. As soon as the job has started, or respectively, completed, its status will be shown in this view. Here you can also manually start, or if needed interrupt running backups.
To test your newly created job, click on the Play-icon. You will see in the table that your job is currently running. By clicking on the Log-icon you can display the job’s progress in the form of a log output.
As soon as the backup has completed this will also be shown in the table.
9.4. Backup format
Every backup job creates a directory on the backup target. This directory’s name conforms to the following schema:
- Appliance backups: Checkmk_Appliance-[HOSTNAME]-[LOCAL_JOB_ID]-[STATE]
- Instance backups: Checkmk-[HOSTNAME]-[SITE]-[LOCAL_JOB_ID]-[STATE]
In the wildcard character fields, any - (minus) characters are replaced by + so as not to be confused with the field separators.
During the backup the directory will be saved with the suffix: -incomplete. Once completed the directory is renamed and the suffix changed to: -complete.
A data set mkbackup.info containing the meta information pertaining to the backup is saved in the directory. Alongside this file a number of archives are saved to the directory.
The archive named system contains the appliance’s configuration, system-data contains the data file system’s data – excluding that of the monitoring instances. The monitoring instances are saved in separate archives that use the site-[SITENAME] naming schema.
Depending on the backup’s mode, these data sets are saved with the .tar file extension for uncompressed and unencrypted, .tar.gz for compressed but unencrypted, and .tar.gz.enc for compressed and encrypted archives.
If you want to encrypt your backup you can configure this directly from the web user interface. Your backed-up data will then be completely encrypted before being transferred to the backup target. The encryption is achieved using a predefined encryption key. This key is protected by a password defined when creating the key, and with which the key must be securely retained, as only with these is it possible to retrieve the backed up data.
To this end, open the Device backup page and from there select the Backup keys page. Here you can create a new encryption key. When entering the password be sure to use a sufficiently complex character string – the longer and more complex your password, the harder it is for an attacker to decrypt your key and thus your backup.
Once you have created your key, download it and retain it in a secure location.
An encrypted backup can only be restored with the encryption key and its corresponding password.
Now, from the Device backup edit the backup job that is to create the encrypted backups – there activate the Encryption item and select the freshly-created encryption key.
It is possible to compress the data during the copy procedure. This can be useful if you need to save bandwidth or if space on the target system is limited.
But please be aware however that the compression requires noticeably more CPU time and therefore the backup procedure will take longer. As a rule it is advisable not to activate compression.
Using the web user interface’s built-in functions you can only make a complete restore. Restoring individual data sets via the web interface is not provided. This is nevertheless possible via the command line and by manually unpacking from the backup.
If you wish to restore a complete backup on a currently running appliance, select the Restore item on the Device backup page, and on the next page select the backup target from where you want to source the backed-up data. Once the backup target has been selected a list of all of its available backups will be shown.
Next, click on the arrow beside the backup data you wish to use and the restore will initiate – and following confirmation of a security query the restore will start.
While the restore is running you can view its progress by refreshing the Restore page that will be automatically displayed.
At the end of the restore the appliance will automatically restart – following this new start the restore will be complete.
If you need to completely restore an appliance the disaster recovery runs the following steps:
- You have an appliance with the factory default configuration (a new, identical appliance, or an appliance that has been reset to the factory default).
- Ensure that the firmware version matches that of the backup.
Configure the following minimum settings on the console:
- Network settings.
- Access to the web interface.
In the web interface, configure:
- the backup source from which you wish to restore.
- for an encrypted backup upload the security key.
From Checkmk version 1.4.0i1, for every configured backup job the Service Discovery on the appliance has a new service: Backup [JOB-ID]. This service notifies of potential problems with the backup, and displays useful values such as size and duration.
9.9. Special features with clusters
The complete configuration of the backups, including the encryption keys will be synchronised between the cluster nodes. The cluster nodes run the backups separately, and likewise save separate directories for their backups on the backup target.
The active cluster node backs up the complete appliance including the data from the data file system and from the monitoring site. The inactive cluster node saves only its local appliance configuration.
10. Mounting network file systems
If for example, you wish to make a backup on a shared resource, you must first configure the required network file system.
The network file system (NFS Version 3), the Windows Shared Resources (Samba or CIFS) and SSHFS (SFTP) are currently supported.
Mounting a network file system
In the web user interface’s main menu select the item Manage mounts and from here create a new file system. Enter an ID that will later be used in devices to identify the file system.
Next select if and how the file system is to be mounted. Recommended is automatic mounting when accessed and, respectively, automatic unmounting when inactive.
Next configure the type of share to be mounted, and finally, depending on this, the necessary settings for mounting the share - for example the file server’s network address and the exported file path in the case of NFS.
Once saved the newly-configured file system and its current status can be viewed in the file system management. By clicking on the plug icon you can manually mount the file system to test that the configuration is correct.
11. Failover cluster
You can combine two Check-MK appliances into a failover cluster. All configurations and data are synchronized between the two devices. The devices that are connected as a cluster are also called nodes. One of the nodes in the cluster assumes the active role, i.e. performs the tasks of the cluster. Both nodes continuously exchange information about their status. As soon as the inactive node recognizes that the active node can no longer fulfill its tasks – due to a failure for example – the inactive node takes over the tasks and becomes the active node.
The failover cluster is there to increase the availability of your monitoring installation by protecting a device or individual components against hardware failures. The clustering is not a substitute for data backups.
The cluster ensures a shorter downtime in the following situations:
- There are two servers: one (active) server performs tasks, such as Monitoring, and the other (inactive) server simply checks that the first server is fulfilling its tasks.
- If the active server can no longer access the network, it cannot perform its tasks (for example, Monitoring).
- The inactive server notices this and takes over the tasks automatically.
- The active server becomes inactive and the inactive server is now active – thus swapping their roles.
- The server that is now active and performing the monitoring has also taken over the resources.
- If you carry out a firmware update you can update the nodes individually. While one node is being updated the other node will continue to perform the monitoring.
In order to build a cluster you will first need two compatible Checkmk appliances. The following models can be clustered with one another:
- 2x Checkmk rack1
- 2x Checkmk rack4
- 2x Checkmk rail2
- 2x Checkmk virt1
- 1x Checkmk rack1 and 1x Checkmk virt1
In addition, the two devices must use a compatible firmware, and at least version 1.1.0.
The devices must be wired with at least two mutually-independent network connections. It is recommended to use as direct a connection as possible between the devices, and to make a further connection over your LAN.
To increase the availability of network connections, you should – instead of using two connections via individual network connectors – create a bonding configuration that uses all four of the Checkmk rack1’s network connectors. Use the LAN1 and LAN2 interfaces for the connection to your network, and the LAN3 and LAN4 interfaces for the direct connection between the devices.
Virtual machines: If you want to perform the cluster function with two ‚Checkmk virt1’ appliances and VirtualBox, for example for testing, you should do without the bonding configuration and with a total of four network interfaces - this becomes a gamble under VirtualBox, if it works at all. Even if both VMs run on the same machine and thus there are no multiple hardware lines, you still need two virtual network interfaces to be able to set up a separate channel to synchronize the data later. You can easily add these in the VirtualBox management interface of the virt1 machine.
So instead of setting up the bonding as shown below, simply activate the unused second network interface - but not for your normal LAN subnet (e.g., 192.168.178.0/24) – but a separate subnet (e.g. 192.168.100.0/24). For the actual clustering you simply select your two individual interfaces instead of the bundled bonding interfaces.
11.3. Migration of existing installations
Devices that were delivered and initialised with the firmware version 1.1.0 or higher can be clustered without migration.
Devices initialised with earlier firmware must first be updated to version 1.1.0 or higher. The device’s factory settings then need to be restored, preparing the device for clustering. Please note that, in order to prevent data loss during this procedure, you must first back up your data from the device and then restore it.
11.4. Configuration of the cluster
This guide assumes that you have already pre-configured both devices so that the web interface can be opened with a web browser.
Before actually setting up the cluster you must first prepare both devices. This mainly involves adapting the network configuration to fulfill clustering requirements (see prerequisites).
The configuration of a cluster with two Checkmk rack1 is shown in the following. A cluster is built which looks as shown in the diagram below.
The interface designations LAN1, LAN2 etc., used in the diagram correspond to the designations of the physical interfaces on the device. In the operating system, LAN1 corresponds to the device eth0, LAN2 to the device eth1 etc.
This configuration complies with the recommendations for the clustering of two Checkmk rack1s. You can of course use IP addresses suitable for your environment. Make sure however that the internal cluster network (bond1 in the diagram) uses a different IP network to the ‚external‘ network (bond0 in the diagram).
Open the web interface of the first node, select Device settings and Network settings at the top. You will now be on the network settings configuration page. There are two modes available to you here. The Simple Mode which you can only use to configure your device’s LAN1 is activated by default.
The Advanced mode is required for clustering. In order to activate this mode click on the Advanced mode button at the top and confirm the security prompt.
On the following page you will see all of the network interfaces available in the device. Only the interface eth0 (corresponding to LAN1) – enp0s17 in the screenshot will currently have a configuration which was applied by the Simple Mode.
Now create the first bonding interface bond0 by clicking on Create Bonding. For this purpose enter into the dialogue that follows all data as shown in the diagram below, and confirm the dialogue with Save.
Now create the second bonding interface bond1 with the appropriate configuration.
After you have created the two bonding interfaces, in the network configuration dialogue you will be able to review all of the settings for the network interfaces ...
... and likewise the bondings:
Once you have successfully completed all configuration steps, make the settings effective by clicking on Activate Changes. The new network settings will then be loaded. After few seconds the network interface configuration will look like this, with OK Statuses for the interfaces:
And the bonding configuration will look like this:
Now, with the appropriate settings, repeat the configuration of network settings on your second device.
Devices to be connected in a cluster must have different host names. You can specify these now in the device settings. In our example, we configure node1 as a host name on the first device and node2 on the second device.
Connecting the cluster
Having completed preparations you can now continue setting up the cluster. To do this open the Clustering module in the main menu of the first device (here node1) in the web interface and click on Create Cluster.
Now enter the appropriate configuration in the cluster creation dialogue and confirm the dialogue with Save. If you require more information about this dialogue, click on the icon beside the Checkmk logo in the top right-hand corner. Context help will then appear in the dialogue explaining the individual options.
On the following page you can connect the two devices to form a cluster. To do this you need to enter the password of the web interface of the second device. This password is used once to establish the connection between the two devices. Then confirm the security prompt if you are sure that you want to overwrite the data of the target device with the IP address displayed.
Once this connection is successful, cluster setup is commenced. You can have the current status displayed on the cluster page. As soon as the cluster has been successfully built, the synchronisation of monitoring data will start from the first to the second node. While this synchronisation is still taking place, all resources – including any monitoring instances you may have – will be started on the first node.
From now on you can, using the cluster IP address (here 192.168.178.110), access the resources of the cluster (e.g., your monitoring instances), regardless of the node by which the resources are currently being held.
11.5. The state of the cluster
When the first synchronisation is complete, your cluster will be fully operational. You can view the state at any time on the cluster page.
Using the status screen on the console you can also view the current state of the cluster in a summarised form in the Cluster box. The role of the respective node is shown after the current status with (M) for the master host and (S) for the slave host.
11.6. Special cases in the cluster
Access to resources
All requests to the monitoring instances (e.g. web interface access) as well as incoming messages (e.g. SNMP traps or syslog messages to the event console or requests to Livestatus) should normally always be sent via the cluster IP address.
Only in exceptional cases (e.g. diagnostics or updates of a particular node) should you need to access the individual nodes directly.
The settings (e.g. time synchronisation or name resolution settings) that have been made independently on the individual devices until now are synchronised between the two nodes in the cluster.
However, you can only execute these settings on the node that is active at the time. The settings are locked on the inactive node.
There are some device-specific settings, (e.g. those of the management interface of the Checkmk rack1) which you can adapt to the individual devices at any time.
IP addresses or host names of the nodes
To be able to edit the IP configuration of the individual nodes, you must first disable the connection between the nodes. To do this click on Disconnect cluster on the cluster page. You can then adapt the desired settings via the web interface of the individual nodes.
Once you have made the adjustments you must now select Reconnect cluster on the cluster page. If the nodes can be successfully reconnected the cluster will resume operation after a few minutes. You can see the status on the cluster page.
Administering Checkmk versions and monitoring instances
The monitoring instances and Checkmk versions are also synchronised between the two nodes. You can only modify these in the web interface of the active node.
11.7. Administrative tasks
Firmware updates in the cluster
The firmware version of a device is not synchronised in cluster operation. The update is thus carried out for each node. You have the advantage however that one node can continue performing the monitoring while the other node is updated.
When updating to a compatible firmware version, you should always proceed as follows:
First open the Clustering module in the web interface of the node to be updated.
Now click on the heart symbol in the column of this node and confirm the security prompt that follows. This will put the node into maintenance state.
Nodes that are in maintenance state release all resources currently active on the node, upon which the other node takes control of them.
While a node is in maintenance state, the cluster is not failsafe – so if the active node is now switched off, the inactive node in maintenance state will not take control of the resources. If you now additionally put the second node into maintenance state, all resources will be shut down. These will only be reactivated when a node is taken out of its maintenance state. You must always remove the maintenance state again manually.
If the cluster page shows the following you will see that the node is in maintenance state.
You can now perform the firmware update on this node, and likewise on standalone devices.
After you have successfully performed the firmware update, open the cluster page once more and remove the maintenance state of the updated device. The device will then automatically merge into cluster operation, upon which the cluster again becomes fully functional.
11.8. Disbanding clusters
It is possible to disband the nodes from a cluster and continue running them separately. When doing so you can continue using the synchronised configuration on both devices, or reset one of the devices to factory settings and reconfigure it for example.
You can remove one or both nodes from the cluster during operation. If you wish to use both nodes, you must ensure that the data synchronisation is in good working order beforehand. You can verify this on the cluster page.
In order to disband a cluster, click on Disband Cluster on the cluster page of the web interface. Read the text of the confirmation prompt that follows. In the different situations, this text contains information as to the state the respective device is in following the disconnection.
The disconnection of the devices must be carried out on both nodes separately, so that both devices can be run separately in future.
If you only wish to use one of the devices in future, disconnect the cluster on the device you intend to continue using and then restore the factory settings on the other device.
Once you have disconnected a node from the cluster the monitoring instances will not be started automatically. If you wish to start the monitoring instances, you will need to do so via the web interface.
Exchanging a device
If the hard drives of the old device are in good order, you can take these from the old device and insert them into the new device, wiring the new device in exactly the same way as the old device was wired and then switching it on. After starting the new device will merge into the cluster in the same way as the old device.
If you want to completely replace an old device with a new one, you should proceed in the same way as when disbanding the cluster completely (see previous chapter). To do this select one of the previous devices, disconnect this device from the cluster and create a new cluster with this device and the new device.
11.9. Diagnostics and troubleshooting
Cluster administration is a largely automatic function, whereby automatic processes on the nodes decide for which device which resources are to be started and stopped on. This behaviour is logged in the form of detailed log entries. You can access these entries from the cluster page by pressing the button Cluster Log.
Please note that these entries – just like the other system messages – are lost when restarting the device. If you would like to keep the messages for longer you can download the current log file over your browser or set up a permanent forwarding of log messages to a syslog server.
12. SMS notifications
It is possible to attach a GSM modem to the device in order to have SMS notifications sent over it by Checkmk (in the event of critical problems for example).
At the moment it is not possible to order a UMTS/GSM modem together with your appliance nor later as an accessory. But there are several modems like the MTD-H5-2.0, which are compatible to the appliance.
12.2. Starting up the modem
In order to put the modem into operation you must insert a functioning SIM card, attach the modem to a free USB connector on your appliance using the enclosed USB cable, and connect the modem to the mains using the enclosed power adapter.
As soon as this has been done the device will automatically detect the modem and set it up. Open the device’s web interface and select the Manage SMS module. The current state of the modem as well as the connection with the mobile phone network will be displayed on this page.
If you need to enter a PIN to use your SIM card, you can specify this PIN under SMS settings.
If sent messages do not reach you, you can view all sent or non-sent messages and messages awaiting sending on the page Manage SMS. The entries in these lists will be kept for a maximum of 30 days and then automatically deleted.
It is possible, via the menu item Send test SMS, to send a test SMS to a number of your choice.
13. Administering RAID on the racks
13.1. The RAID system
Your rack has two hard drive bays on the front. These are marked with numbers 1 and 2. The hard disks installed here are interconnected in a RAID-1 array (mirror) so that your data is stored redundantly on both hard disks. If one of the hard disks fails the data is still available on the second hard disk.
13.2. Administration in the web interface
You can view the state of the RAID in your device’s web interface. To do this select the item RAID-Setup in the main menu of the web interface. This screen also gives you the option to repair the RAID if necessary.
13.3. Exchanging a defective hard drive
If a hard drive is detected as being defective, this will be displayed in the web interface with defective. On the actual device – depending on the nature of the error – this will be shown by a blue flashing LED at the hard drive bay.
Moving the small lever on the left-hand side of the bay will unlock the fixing mechanism, enabling you to pull the frame out of the housing together with the hard drive. You can now loosen the screws on the underside of the frame and remove the defective hard drive. Now mount the new hard drive into the frame and push the frame back into the free bay of the device.
If the device is switched on while you are exchanging the hard drive, the RAID rebuild will start automatically. You can view this procedure’s progress in the web interface.
Failsafe operation is only restored once the RAID has been completely repaired.
13.4. Both hard drives defective
14. Management interface in the rack
Your rack has a built-in management interface that allows network access to the device even when it is not powered on. You can use the web interface of this management interface, for example, to control the device if it is not switched on or no longer accessible, and to remotely control the local console.
If you would like to use the management interface, you must first connect the dedicated IPMI LAN connector with your network.
For security reasons we recommend connecting the IPMI LAN with a dedicated management network where possible.
The management interface is delivered deactivated. You can activate and configure it via the Management Interface setting in the device settings.
You must assign a separate IP address for the management interface and specify dedicated access data for the access to the management interface.
Despite careful tests, it cannot be altogether ruled out that unexpected errors may occur, which are difficult to diagnose without looking at the operating system.
One option is to have the log entries that are generated on the system sent via syslog to a syslog server. However the log entries of the individual monitoring instances are not processed via syslog, meaning they are not forwarded and can only be viewed on the device.
In order to make diagnostics on the device easier there is a view that displays the device’s various log files. You can go to this view by clicking on the Log Files menu item in the web interface’s main menu.
You can select the device’s logs and view their current content here.
The system log is reinitialized each time the device is started up. If you would like to keep the log entries you must send them to a syslog server.
It is also possible to view the system log on the local console. The latest entries from the system log are displayed on the second terminal. You can access this terminal via the key combination CTRL+ALT+F2. All kernel messages are displayed on the third terminal. In the case of hardware problems, you will find the relevant messages here. This terminal can be accessed via the key combination CTRL+ALT+F3. The key combination CTRL+ALT+F1 will take you back to the status screen.
15.2. Available Memory
The system memory of the device is available to your monitoring sites, reduced by the amount of memory which is needed by the system processes of the Checkmk appliance.
To provide a stable system platform a fixed amount if memory is reserved for the mandatory system processes. The exact amount of reserved memory depends on your device configuration:
- Standalone device (no cluster configuration): 100 MB
- Clustered: 300 MB
If you want to know exactly how much memory is available to your monitoring sites and how much is currently being used, you can monitor your device using Checkmk. After a service discovery the host automatically monitors a service User_Memory which shows you the current and historical values.
In case your you monitoring instances are trying consume more memory than available, one of the processes of the monitoring sites is automatically killed. This is done by standard mechanisms in the Linux Kernel.
16. Service and support
You can get up-to-date support information from our website. You will find the latest version of the documentation here as well as general information which is regularly updated and more detailed than this manual.
You will find the latest firmware versions on our website. You can access this firmware using the access data for your current support contract.
16.4. Hardware support
In the event of hardware failure please contact us by email at email@example.com, or call us on +49 89 99 82 097 - 20. The problem will be handled by the distributor directly and in accordance with the maintenance agreed upon.
16.5. Software support
In the case of a software fault – whether firmware or Checkmk monitoring software – please contact us via your company’s own support address. Support will be provided based on the agreed support contract.