Werk #16565: Introduce distributed tracing of Checkmk
Komponente | Core & setup | ||
Titel | Introduce distributed tracing of Checkmk | ||
Datum | 09.08.2024 | ||
Level | Bedeutende Änderung | ||
Klasse | Neues Feature | ||
Kompatibilität | Kompatibel - benötigt kein manuelles Eingreifen | ||
Checkmk versions & editions |
|
With this change, we aim to improve our capabilities as Checkmk developers to understand the performance and behavior of Checkmk in a distributed environment better. This will help us to identify issues and bottlenecks and improve the performance of Checkmk in the future. Besides logging, metrics and profiling, it is another tool which gives us great insights.
This change is not meant as a new user feature, e.g. to do distributed tracing of another software with Checkmk. Nevertheless, we feel it makes sense to document it here so that you are aware of it when you see the new related configuration options.
Goal & requirements
Developers can get easier and better insights into complex and large-scale production and test environments of Checkmk.
- The solution shall be usable in a production environment without major performance impact.
- No extra complex setup, such as installation of software in addition to Checkmk, shall be required for the traces to be captured and visualized.
- The solution shall capture execution steps and collect timing information of them.
- For each invocation of these program flows a supporter can visualize traces including the nested spans across components.
- The data can be visualized right in the production environment.
- No data is sent to any external system except it is explicitly configured by the user.
High level concept
We use OpenTelemetry as the underlying technology for distributed tracing. The Checkmk applications are instrumented to send traces. Those traces are sent to a collector, which is running in every Checkmk site or just in the central site, in case of a distributed Checkmk environment. This collector is Jaeger, which is a popular open-source distributed tracing tool. The traces are stored in memory and can be viewed in the Jaeger UI.
Configuration
Sending and receiving traces is configured through omd config
. The two options
TRACE_RECEIVE
and TRACE_SEND
are used to enable or disable the receiving and
sending of traces.
The TRACE_RECEIVE
option will tell the Checkmk applications to send traces via
OTLP to a collector, which can be the local Jaeger instance or any other OTLP
collector of your choices.
The option TRACE_SEND
enables the sites local Jaeger instance. The site apache
enables the exposure of the Jaeger UI via http://checkmkhost/site_id/jaeger/
.
Example
If you have just one site and want to enable tracing, set TRACE_RECEIVE
to on
.
Secondly set TRACE_SEND
to on and set TRACE_SEND_TARGET
to local_site
.
Then start your site again. You can now access the Jaeger UI via the URL
http://checkmkhost/site_id/jaeger/
. After a few seconds you should see the
first traces in the UI.
In a distributed setup, you would enable TRACE_RECEIVE
on the central site and
TRACE_SEND
on all other sites and point TRACE_SEND_TARGET
to the central
site's collector (TRACE_RECEIVE_ADDRESS:TRACE_RECEIVE_PORT
).
Even if tracing is not much overhead locally, and of course some overhead in the distributed setup, because the traces need to be sent to the central site, we recommend enabling tracing only when you need it and disable it again when you don't need it anymore.