Werk #14451: mk_logwatch: lost log messages

Component Checks & agents
Title mk_logwatch: lost log messages
Date Jul 12, 2022
Level Trivial Change
Class Bug Fix
Compatibility Compatible - no manual interaction needed
Checkmk versions & editions
2.2.0b1
Not yet released
Checkmk Raw (CRE), Checkmk Enterprise (CEE), Checkmk Cloud (CCE), Checkmk MSP (CME)
2.2.0b1
Not yet released
Checkmk Raw (CRE), Checkmk Enterprise (CEE), Checkmk Cloud (CCE), Checkmk MSP (CME)
2.2.0b1 Checkmk Raw (CRE), Checkmk Enterprise (CEE), Checkmk Cloud (CCE), Checkmk MSP (CME)
2.1.0p9 Checkmk Raw (CRE), Checkmk Enterprise (CEE), Checkmk MSP (CME)

This werk fixes the occasional loss of messages reported by the agent plugin mk_logwatch.

Sometimes messages could be "stolen" by (for instance) a manual execution of cmk -d MyHost.

Previously:

Every time the logwatch plugin was executed, it gathered all relevant log messages that have accumulated since its last execution. Those messages where then reported as agent output during this execution of the plugin, and never again.

The problem: Log messages are only dealt with by the checking engine of Checkmk. If the plugin is executed in a different context (such as the HW/SW inventory or service discovery) the reported log messages are lost. To mitigate this problem, the default caching parameters of a site are carefully calibrated so that this should rarely only occur. However a manual execution of cmk -d MyHost for instance can always result in data loss.

From now on:

Every time the plugin is executed, as previously the plugin gathers all relevant messages since its last execution. It now puts these messages in a bundle and stores it on disk. Then all bundles that have been created within a configurable period, the retention period, are output and sent to the monitoring site. The monitoring site will keep track of the bundles, and only process the ones it has not seen before.

You can configure the retention period using the ruleset Text logfiles (Linux, Solaris, Windows).

The retention period should be - at least - as long as the check interval of the host, to decrease the risk of data loss drastically. A value that is much smaller than the hosts check intervall, will not automatically lead to data loss, tough. It will just not help preventing it.

Note that putting the N-fold of the hosts check intervall will result in every bundle of messages being fetched N times (during regular operation). As a result the amount of transmitted data increases, obviously. Also those bundles are stored on disk on the monitored host, taking up as much space as the transmitted data. Those are the two ressources that are being traded for a reduced risk of data loss.

To the list of all Werks