At our Checkmk Conference #6, our chief developer Lars Michelsen told you about the projects our development team is currently working on, and what we have planned for you with improvements in Version 2.0. We have therefore selected some important aspects from his lecture, which we naturally do not want to withhold from you.
An important component in the further development of Checkmk is the improvement of the user experience. The UX is so important to us that our founder Mathias Kettner gave a detailed lecture at the conference, and gave you feedback using various tech previews. We also want to create a separate blog post for you on the subject of UX and Checkmk. However here in this post we will focus on the points that Lars touched on in his lecture:
Among other things, the Raw Edition will receive a modernized graphing, one which is already being used in the Enterprise Edition. Our goal is to provide a consistent look and feel for all editions. This means that the HTML5 graphs from the Enterprise Edition will completely replace the PNP4Nagios previously used for visualizations in the Raw Edition – but not with the full range of functions, as Lars emphasized. For example, individual or combined graphs are still only available in the Enterprise Edition of Checkmk. The technical conversion behind it now also enables Grafana integration for the Raw Edition.
We also have some innovations on the agenda for Version 2.0 when it comes to analyzing historical data. A few functions can already be used in Version 1.6., so there is already the possibility of predicting the future development of metrics. This can help identify problems early and anticipate capacity requirements. This function can also be embedded in reports.
Improved reporting and context-sensitive dashboards
We are also expanding the reporting options in Checkmk with Version 2.0. We have planned that the PDF reports can now optionally contain a table of contents and a cover page. In addition, it should be possible to define specific periods for individual elements. Up until now it has only been possible for all elements to be assigned the same time period.
There will also be a lot going on when it comes to the subject of dashboards, such as new forms of visualization (dashlets), or the easier creation of dashboards. These are now context-sensitive, and allow the presentation of specific facts, such as all information for Host X. For version 2.0 we want to make dashboards available for certain applications, in addition to Checkmk for vSphere, Linux or Windows.
Our development team has also been working on a so-called 'tag-usage' overview. This view allows the user to evaluate where which tags and tag groups are used in which folders, hosts or rules.
Lars also announced further developments for the topic of cloud and containers. Since the introduction of AWS monitoring with Release 1.6, we have added additional checks to Checkmk, for example for AWS Glacier or DynamoDB. According to Lars, these arose primarily from user inquiries. Concerning Microsoft Azure, we've lately spent comparatively more time improving the monitoring of AWS. For the monitoring of Azure, however, we have created an important function for the monitoring of the coupling of Active Directory, which is relevant for the majority of the Azure environments.
According to Lars, the monitoring of Kubernetes is already working well. We have also added additional checks to the functionality of our Kubernetes monitoring – such as a check for Ingress objects, and an inventory plug-in. Checkmk is now also able to determine inventory data for jobs in Kubernetes clusters, and to evaluate their status via a check. The integration of Prometheus also plays a role when monitoring Kubernetes environments with Checkmk – we will provide you with more detailed information on this soon.
An essential part of the cloud and container connection in Checkmk is the Dynamic Configuration Daemon (DCD). It ensures that Checkmk takes over configuration changes automatically. In this way, Checkmk detects, for example, volatile elements in dynamic infrastructures, and removes them automatically after they expire. Here, too, our development department has had a hand in optimizing the processing of piggyback data in Checkmk – enabling the lifespan of data to be the freely-configured, for example. It is also possible to extend the DCD with your own connectors for connecting your own data sources, such as a CMDB. To make this even easier, we plan to document and improve the DCD plug-in API.
More options for analyzing network problems
In addition to monitoring servers, network monitoring also plays a major role for our Checkmk users. Here, as tribe29, we want to offer even better opportunities in the future when it comes to analyzing network problems. With Version 2.0 Checkmk will offer the possibility to analyze network flows thanks to the integration with ntop.
In addition, we have already presented a number of new checks – for example, for monitoring VPN connections, which, according to Lars, is increasingly in demand due to the current developments around Covid-19 and the increased practice of working from home. We have also recently added some new check and inventory plug-ins for various manufacturers.
Lars also explained a lot about the automation within Checkmk at the conference: In the future, the Agent Bakery should be able to handle segmented networks better in order to save bandwidth. The Distributed Agent Bakery implements this function. What is new here is that not all agent updaters communicate with the central agent bakery, but first speak to their respective local site. Agents that do not exist in the remote site are fetched from the central site and cached locally if it makes sense. In addition, we have now implemented bakelets as a general concept to support the automatic distribution of customized data. Until now, adjustments to plug-ins that were installed could only be distributed via the bakery if a bakelet supported this. In addition to the already available Notification plug-in for Cisco Webex Teams, another plug-in is also in the pipeline for Microsoft Teams.
Lars also broached another major project by our development team this year – the replacing the previous Web API with a REST API. Here the technical foundation is already in place, and we now want to fill it with life. Checkmk developer Christoph Rauch gave a deeper insight into the new REST API on the second day of the conference.
We are also working on the revision of the previous Check API. Since, according to Lars, this is the monitoring software's most important plug-in API, the first step here was to develop a good concept. Developer Moritz Kiemer also explained what the new Check API should look like on the second day of the event. We will soon present the information on this and the REST API here in the blog.
Better performance in Checkmk
Since we have numerous enterprise customers worldwide, and more than half of the companies listed on the DAX rely on Checkmk to monitor their IT infrastructure, the scalability of the solution is an absolute must for us. At the moment so called helper processes do both the computation of the check results and collect the raw monitoring data. The RAM-intensive helper process of checking often has to wait for network I/Os and thus limits the scalability of the helpers.
In order to reduce the dependency on the network I/O, our development has now divided the helper process in two. From Version 2.0, fetcher processes should collect the raw monitoring data. According to Lars, the fetchers only need a small amount of RAM for this process. They are also able to wait for network I/Os. It is also possible to set a timeout window for a query. Therefore large numbers of Fetcher processes can be executed in parallel on the system without requiring a lot of RAM as before, Lars explained.
A checker process then evaluates the collected data. They correspond to today's helper processes, only without communication, are CPU-bound and should not consume more RAM than before. Since they are no longer blocked in waiting loops, they can directly calculate the raw data collected. By separating the helper process, we enable companies to further scale monitoring with Checkmk by using our software to use the existing hardware resources much more efficiently and with better performance.
In addition to the scaling, we also worked on the activate changes workflow. Up until now, with each activation Checkmk has packed the entire configuration into a snapshot, synchronized it, and unpacked it again on the desired monitoring sites. In the future we will accelerate this process by having Checkmk synchronize the configuration incrementally when changes are activated. Nothing changes on the surface for the user – 'under the hood' Checkmk only synchronizes a fraction of the previous data, since it only transmits the actual changes made.
Finally, Lars gave an insight into the current status of the Python 3 migration from Checkmk. Here our developers are already on the right track to replace the no longer supported Python 2.7. However, since Python 3 is not backwards compatible, our developers have had to reprogram the more than 760,000 lines of code in Checkmk from Python 2.7 to Python 3. "So it is not enough to change five lines of code..." joked Lars about the workload of the migration. However, the Checkmk developers have already migrated the Checkmk base, the check plug-ins, and many other modules. In addition, we are already building all new special agents on Python 3. However, older agents still have to be migrated. Our developers have also completed the first preparations for the GUI. We are confident that we will have completed the migration with the release of Checkmk 2.0.
Soon – in our next blog entry – we will explain to you exactly how we imagine the integration of the network flow analysis from ntop will be.