Werk #16565: Introduce distributed tracing of Checkmk

Component Core & setup
Title Introduce distributed tracing of Checkmk
Date Aug 9, 2024
Level Prominent Change
Class New Feature
Compatibility Compatible - no manual interaction needed
Checkmk versions & editions
2.4.0b1 Checkmk Raw (CRE), Checkmk Enterprise (CEE), Checkmk Cloud (CCE), Checkmk MSP (CME)

With this change, we aim to improve our capabilities as Checkmk developers to understand the performance and behavior of Checkmk in a distributed environment better. This will help us to identify issues and bottlenecks and improve the performance of Checkmk in the future. Besides logging, metrics and profiling, it is another tool which gives us great insights.

This change is not meant as a new user feature, e.g. to do distributed tracing of another software with Checkmk. Nevertheless, we feel it makes sense to document it here so that you are aware of it when you see the new related configuration options.

Goal & requirements

Developers can get easier and better insights into complex and large-scale production and test environments of Checkmk.

  1. The solution shall be usable in a production environment without major performance impact.
  2. No extra complex setup, such as installation of software in addition to Checkmk, shall be required for the traces to be captured and visualized.
  3. The solution shall capture execution steps and collect timing information of them.
  4. For each invocation of these program flows a supporter can visualize traces including the nested spans across components.
  5. The data can be visualized right in the production environment.
  6. No data is sent to any external system except it is explicitly configured by the user.

High level concept

We use OpenTelemetry as the underlying technology for distributed tracing. The Checkmk applications are instrumented to send traces. Those traces are sent to a collector, which is running in every Checkmk site or just in the central site, in case of a distributed Checkmk environment. This collector is Jaeger, which is a popular open-source distributed tracing tool. The traces are stored in memory and can be viewed in the Jaeger UI.

Configuration

Sending and receiving traces is configured through omd config. The two options TRACE_RECEIVE and TRACE_SEND are used to enable or disable the receiving and sending of traces.

The TRACE_RECEIVE option will tell the Checkmk applications to send traces via OTLP to a collector, which can be the local Jaeger instance or any other OTLP collector of your choices.

The option TRACE_SEND enables the sites local Jaeger instance. The site apache enables the exposure of the Jaeger UI via http://checkmkhost/site_id/jaeger/.

Example

If you have just one site and want to enable tracing, set TRACE_RECEIVE to on. Secondly set TRACE_SEND to on and set TRACE_SEND_TARGET to local_site. Then start your site again. You can now access the Jaeger UI via the URL http://checkmkhost/site_id/jaeger/. After a few seconds you should see the first traces in the UI.

In a distributed setup, you would enable TRACE_RECEIVE on the central site and TRACE_SEND on all other sites and point TRACE_SEND_TARGET to the central site's collector (TRACE_RECEIVE_ADDRESS:TRACE_RECEIVE_PORT).

Even if tracing is not much overhead locally, and of course some overhead in the distributed setup, because the traces need to be sent to the central site, we recommend enabling tracing only when you need it and disable it again when you don't need it anymore.

To the list of all Werks