The first part of this blog series focused on showing why you should literally think things 'through to the end' when monitoring applications end to end. Infrastructure monitoring is undoubtedly important and essential, as this monitors the very basis on which applications are built. But only tests performed on a level above the OSI layer 7 are capable of gaining valuable insights into the 'End-User Experience' which is provided by software.

This article is based on an older version of Robotmk (0.19 / Checkmk 1.6), but most parts are still valid. Robotmk is developed for both Linux and Windows. However, before configuring your end-to-end monitoring, you need to install Robot Framework and Selenium on your system. You also need to have Pyhton 3 installed on your client, then you can easily add Robot Framework including libraries  using pip install. For Selenium you also need a webdriver, which acts as a link between Robot and the browser.

After a short explanation of the test structure, this article explains how to install the Robotmk-MKP in Checkmk and how to supply the Checkmk agent for Windows with the Robotmk plug-in using the Bakery. And lastly, the article goes into the details of the various options for configuring Checkmk:

TL;DR — this is how Robotmk works

  • The Robotmk plug-in is provided on the client along with the YML control file.
  • The CheckMK agent runs the plug-in at the specified intervals.
  • The Robotmk plug-in determines which robot suites are to be started with which parameters using the YML control file created by the Agent Bakery. These suites are executed sequentially.
  • At the end of each Robot suite, the plug-in passes the XML result file written by Robot to the Checkmk agent within the Robotmk section header <<<robotmk>>>.
  • The robotmk check on the checkmk page evaluates the <<<robotmk>>> section. The RobotXML, using the monitoring rules, is parsed and checked for runtime violations, etc.

How the Robot Framework test suite works

The Robotmk plug-in, which will later execute the test, by default expects to find the robot framework tests in the agent directory C:\ProgramData\checkmk\agent\robot. Theoretically, .robot files can be placed directly there, but it is recommended to separate these from the beginning with sub-directories. This not only facilitates version control (e.g. by GIT), but also provides a better overview.

The robot file seleniumeasy.robot (see listing below) is now copied into the directory.

saving the robot file in the agent directory
Saving the robot suite in the agent directory.

seleniumeasy.robot is an intentionally simple test, which nevertheless has a few robot-typical features.

Structure of a test suite

The structure of Robot files can be seen in the code example below:

  • *** Settings *** Import of the Selenium library – actions before (Setup), and after (Teardown) – the start of the suite.
  • *** Variables *** The declaration of variables
  • *** Test Cases *** A suite consists of at least one test. Robot Framework always tries to run all tests – if one test fails, the next one is started. Therefore, when possible, tests should always be written to run without preconditions. For the common preconditions (here: start of the browser, opening of the page, closing of the pop-up) the keyword Setup – which is named as the Suite Setup in the Settings – is responsible.
  • *** Keywords *** User-defined sequences of existing keywords. Keywords can be nested to any depth in order to abstract the test - indentations of at least 2 spaces indicate the nesting depth.
The code of the robot file
In the code you can see the structure of the robot file

Final manual check

Before the test runs automatically, a final manual check on the console is useful:

robot .\seleniumeasy.robot
==============================================================================
Seleniumeasy
==============================================================================
Check How Tool1 is | PASS |
Foo is great
------------------------------------------------------------------------------
Check How Tool2 is | PASS |
Bar is great
------------------------------------------------------------------------------
Seleniumeasy | PASS |
2 critical tests, 2 passed, 0 failed
2 tests total, 2 passed, 0 failed
==============================================================================
Output: C:\Users\elabit\Documents\output.xml
Log: C:\Users\elabit\Documents\log.html
Report: C:\Users\elabit\Documents\report.html

Note that the text set by the keyword Set Test Message contains the default variables 'Foo' and 'Bar' respectively. In the following steps you will learn, among other things, how to set these variables from Checkmk.

Installing Robotmk

The next step is to install Robotmk on the Checkmk system. The standard method for this is to install an MKP file, of which the latest version can be downloaded either from the Checkmk Exchange or directly from the project's Github page.

The RobotMK package in Checkmk
The installed Robotmk package.

Deployment of the Robotmk plug-ins

The MKP that has just been installed contains, among other things, the Robotmk plug-in. To install the plug-in on a Windows host, Checkmk Enterprise Edition users can deploy the plug-in using the Agent Bakery, which has a Robotmk rule. The easiest way to find this rule is to search for the term 'robot' under WATO ➳ Host & Service Parameters:

Using search function to find the bakery rule
The search function helps to find the Bakery rule for Robotmk.

After creating the new Robotmk rule, under Type of execution you first specify how the Robotmk plug-in should be executed on the target host - you can choose between Async (the default) and Spooldir.

Configuring the type of execution
Determining the manner of execution for the plug-ins.

Async causes the Checkmk agent to start the plug-in with a different time interval from the default 1-minute interval for reading the agent. The Execution interval can be set in the field below. Between the 'real' executions of the plug-in, the agent then always delivers the cached result from the last execution to the Checkmk server.

At this point it is very important to know that all plug-ins started by Checkmk run in the same user context as the agent – on Windows it is the LOCAL_SYSTEM service account. In the Task Manager you can see that the agent's process is running in session ID 0. In this context, from Windows 7 onwards, by definition, no interaction with the desktop is possible (even the Allow data exchange between service and desktop checkmark in the service properties does not change this as it is a legacy).

Task Manager showing Session ID 0
Session ID 0 of the service account does not allow desktop access.

For various reasons, it may thus not be possible or desirable to start a test in this session.

The Spooldir mode is used whenever the Robot Framework has to start the SUT (System under Test) with a special user or when you test a GUI application that requires the desktop for display. In this case, starting the Robotmk plug-in is left to an external mechanism, e.g. the Windows task scheduler or cron/at under Linux. The plug-in then automatically writes the results to the spool directory, and the agent is thus only responsible for 'collecting' the data.

Setting the test interval of the test suite
Scheduling of the 'selenium_test' suite at 2-minute intervals.

The Robot Framework test suites item is optional – if you do not specify any suites, the plug-in simply tries to run all of the suites that can be found in the RobotDir.

However, simply because of the numerous suite-specific settings, one should take a closer look at this option. Here the most important command line options for the Robot Framework can be set – these allow granular control of the execution.

Worth mentioning, for example, is the possibility of starting the suite with certain variables (or reading them in from a file). We make use of this function here – when the variables TOOL1 and TOOL2 are set as indicated below, they will overwrite the default values Foo and Bar in the test.

Setting the Robot variables
Setting the Robot variables.

Furthermore, it is possible to limit the number of sub-suites and tests contained in a suite by black-/whitelisting, as well as to change the termination behavior of Robot: exitonfailure ensures that Robot does not always try to execute all of its tests, as described above, but exits the entire suite at the first failure.

The option (Non-)Critical test tag might mislead die-hard monitoring admins, because this is not related to the 'CRIT' status in monitoring. In the context of Robot, critical refers to tests that can negatively influence the overall status of a suite, which are by default, all of them. If you want to restrict this behavior, you tag individual tests and then enter the assigned tag here. These tests can then fail without affecting the overall result.

With the Piggyback Host option, the robot result can be assigned to a different host. This is useful if you use VMs for the execution of different tests, but want to assign the respective test results to specific hosts in Checkmk.

The option for Agent output encoding does not normally need to be changed, because UTF-8 is the standard encoding of the agent results for Checkmk. If you are bothered by the newlines of the Robot XML when debugging and prefer a compact formatting of the output, you can select base64 here. The third option, zlib-compressed, is particularly interesting for a future release of Robotmk, in which images and videos of error situations will also be transmitted in the agent output. With this option, the data volume shrinks to less than 5 percent.

Agent output encoding and log file rotation
Encoding the result + log file rotation.

The execution of E2E tests is usually host-specific, so it makes sense to set the Explicit Hosts option at the end of the rule, so that the rule matches as closely as possible.

Important: In order for the Robotmk plug-in to be executed on the remote host, it is necessary to add .py to the list of allowed file extensions by using the additional Checkmk rule Limit script types to execute.

The 'baking' and installation of the new agent follows exactly those steps as described in the Checkmk documentation.

After installing the newly created MSI package, the result can be checked on the Windows client:

  • C:\ProgramData\checkmk\agent\plugins should contain robotmk.py
  • C:\ProgramData\checkmk\agent\conf folder should contain robotmk.yml

Monitoring your E2E tests

The Windows client is now ready for use and most likely has already run the selenium_test suite one or more times in the background since the MSI installation (the Chrome browser controlled via Selenium is capable of running 'headless', i.e. without a GUI).

A manual inventory of the Windows host reveals a new service:

Robote Suite regonized as new service
The robot suite selenium_test has been recognized as a new service.

After saving, it is worth opening the detailed view of the newly added service, as this deserves a closer look:

Detailed view for the RobotMK service
The detailed view for the newly-discovered Robotmk service.

One can see that:

  • The Robot Framework interprets directories (selenium_test) and robot files (seleniumeasy) as suites ([S]). The 'root' suite is contained in the first output line.
  • The passing of the variables TOOL1 and TOOL2 has worked – these were correctly passed to the Robot Framework test by the Robotmk plug-in and used in the keyword Set Test Message to set the text in the test.
  • The Selenium keywords provide detailed information on which selectors were used to access the elements in the website.

Thresholds and performance data

What would monitoring be without thresholds? The monitoring of runtimes within a test takes on a special role at Robotmk.

Robotmk takes advantage of Robot Framework's feature of having all elements of a suite (i.e. subsuites, tests, keywords and sub-keywords) with start and end times already recorded in its resulting XML. And because this XML was transported to the Checkmk page in raw format, the Checkmk admin has all of the freedom to use this data for monitoring the runtimes – using WATO rules, they only need to define regex patterns for the names of the robot nodes (suites, tests, keywords), and set the desired WARN/CRIT threshold values for them.

The rest is done by the check during recursive processing of the XML results. The administrator also uses the same pattern system to determine for which elements performance graphs will be generated.

As an example, Checkmk should generate a warning if the runtime for the test Check How Tool2 is exceeds 0.3 seconds. It should additionally generate graphs for both tests.

Responsible for this (and for a lot more) is the Robotmk rule for discovered services:

RobotMK Rule 'discovered services' controls how RobotMK check interprets results
The Robotmk rule for 'iscovered services controls how the Robotmk check interprets results.

The item Runtime thresholds (as well as Perfdata creation) is divided into sections for suites, tests and keywords. The thresholds entered in the screenshot below should be self-explanatory, likewise the two patterns (note that the Perfdata pattern only applies to the second test).

Especially with newly created End2End tests, it can be helpful to show the monitored runtime in the output, not only when it has been exceeded, but at any time. The option Show monitored runtimes also when in OK state controls this.

Include execution date inserts the timestamp from the execution into the line of tests and suites. This is particularly useful when the timestamp from the last result in Checkmk is not sufficient, but it is necessary to determine exactly when the test was in contact with the SUT.

After activating the rule and re-scheduling the check, the service appears in this state:

You can clearly see how the exceeded runtime from the second test case sets the test's status to WARN, as well as those from the two parent suites.

The following screenshot shows one of the two graphs for the runtime of the test cases:

Graph for the RobotMK test cases
Newly created graph for the test cases.

Depending on the use case, it can be useful not only to propagate the non-OK state upwards to tests and suites, but also its cause (the runtime overrun). If the option Show messages of subnodes is activated in the rule that has just been edited, the output becomes even more meaningful:

Activate subnode messages
Showing the sub-node messages.

It is helpful to provide information on the cause of the alarm, e.g. for SMS notifications where only the first line is sent – the on-call service will certainly be grateful!

However, the usefulness of this option must be considered depending on the size of a robot suite and the number of possible simultaneous alarms.

The Service Discovery Level

Imagine that team A only wants to receive the alarms from the Check How Tool1 is test, team B accordingly wants to receive the alarms from the test for tool B. However, you do not want to/cannot split the robot suite to get two separate Checkmk services.

Here Robotmk provides an interesting feature called Service Discovery Level, found as a rule of the same name in the WATO rule management:

Service Discovery Level' rule in the WATO
The Service Discovery Level rule.

As described above, Robot Framework generates suites from robot files and directories that contain them. Robot files, in turn, are the 'containers' for tests (not nestable), which consist of nested keywords.

Each robot result always consists of exactly one suite, from which Robotmk generates one service by default. The Service Discovery Level can be adjusted by a rule so that the service generation does not start at the highest level '0' (the folder suite), but rather, for example, at the level of the tests (here 2):

XML node structure of a robot result
The XML node structure of a robot result.

At the same time, this rule also allows newly detected services to be given a prefix via a Discovery. The somewhat bulky-looking placeholder %SPACE% prevents an intentionally entered space after the prefix from being removed by WATO; this is only converted into a space during the runtime of the inventory and will then separate the prefix from the name as desired:

After saving the rule, a new inventory of the services is necessary. The previous single service (generated from level 0) disappears. Instead, Checkmk now recognizes two new services, which represent the tests in the suite.

Thanks to the Service Discovery Level rule, the source code of this robot suite can remain untouched, and the separate alerting of the two services can be handled with Checkmk's on-board tools.

Documentation

Since it is possible to configure Robotmk completely via the WATO, much emphasis was placed on a comprehensive context help. If one clicks on the book icon in the upper right corner of the Checkmk interface, the Robotmk rules display a detailed help text for each input field. Robot Framework-specific topics are linked to the Robot Framework documentation where applicable.

RobotMK provides a context help
The context help in Robotmk.

Summary

Robotmk bridges the gap between two technologically very different tools, which in many respects fit together perfectly:

  • Their areas of operation in the OSI model complement each other perfectly (see Part 1 of this article: 'End-to-End-Monitoring: How to make sure your applications are running'),
  • they have a modular design and versatile application capabilities,
  • they are Open Source,
  • they have a long-standing market presence and continuity spanning more than 10 years,
  • a growing, strong worldwide community and
  • annual conferences, meetups and workshops, etc.

Status and outlook

In the meantime, the author is working as Product Manager for Synthetic Monitoring at Checkmk to ensure that Robotmk is fully integrated as an enterprise solution in Checkmk 2.3. 

New features in Robotmk v2 include parallel execution of test suites and fully automated deployment of Python environments.

Conclusion

The fact that such an integration of results from tests with Robot Framework into a monitoring system is possible at all is due to the fact that Checkmk strictly separates the collection of data from its evaluation. It is not the plug-in on the client that is parameterized, but the server-side check that evaluates the collected data.

Robotmk makes ample use of the possibility of influencing the underlying Python code with its specially-created WATO masks. Incidentally, this is also one of the reasons why compatibility with other monitoring systems (Naemon, Icinga2, Zabbix, etc.) is excluded.

The Robotmk check only needs the XMLs from the results for its evaluation; the robot tests themselves are monitoring-agnostic. This results in a double benefit – existing robot tests (e.g. from a CI pipeline) can be used in parallel in a monitoring without adaptation, and newly-created application tests can also be used outside of Checkmk.

Note to the author:
Simon is an active member of our Checkmk community and introduces in our blog his Checkmk plugin Robotmk, which is based on the Robot Framework and allows end-to-end monitoring with Checkmk. Simon is CEO of Elabit and specialist for IT topics like monitoring (Checkmk), configuration management (Ansible) or RPA/Robotiv Process Automation and test automation (Robot Framework, Robotmk).