Werk #14451: mk_logwatch: lost log messages
Component | Checks & agents | ||||||||
Title | mk_logwatch: lost log messages | ||||||||
Date | Jul 12, 2022 | ||||||||
Level | Trivial Change | ||||||||
Class | Bug Fix | ||||||||
Compatibility | Compatible - no manual interaction needed | ||||||||
Checkmk versions & editions |
|
This werk fixes the occasional loss of messages reported by the agent plugin mk_logwatch.
Sometimes messages could be "stolen" by (for instance) a manual execution of cmk -d MyHost.
Previously:
Every time the logwatch plugin was executed, it gathered all relevant log messages that have accumulated since its last execution. Those messages where then reported as agent output during this execution of the plugin, and never again.
The problem: Log messages are only dealt with by the checking engine of Checkmk. If the plugin is executed in a different context (such as the HW/SW inventory or service discovery) the reported log messages are lost. To mitigate this problem, the default caching parameters of a site are carefully calibrated so that this should rarely only occur. However a manual execution of cmk -d MyHost for instance can always result in data loss.
From now on:
Every time the plugin is executed, as previously the plugin gathers all relevant messages since its last execution. It now puts these messages in a bundle and stores it on disk. Then all bundles that have been created within a configurable period, the retention period, are output and sent to the monitoring site. The monitoring site will keep track of the bundles, and only process the ones it has not seen before.
You can configure the retention period using the ruleset Text logfiles (Linux, Solaris, Windows).
The retention period should be - at least - as long as the check interval of the host, to decrease the risk of data loss drastically. A value that is much smaller than the hosts check intervall, will not automatically lead to data loss, tough. It will just not help preventing it.
Note that putting the N-fold of the hosts check intervall will result in every bundle of messages being fetched N times (during regular operation). As a result the amount of transmitted data increases, obviously. Also those bundles are stored on disk on the monitored host, taking up as much space as the transmitted data. Those are the two ressources that are being traded for a reduced risk of data loss.