Catch up on the latest product updates, best practices, and expert insights from the Checkmk Conference #12 – Watch the livestream recordings now

Werk #19986: Redfish: retry transient Systems failures and keep previous data instead of dropping services

Component Checks & agents
Title Redfish: retry transient Systems failures and keep previous data instead of dropping services
Date Jun 12, 2026
Level Trivial Change
Class Bug Fix
Compatibility Compatible - no manual interaction needed
Checkmk versions & editions
3.0.0b1
Not yet released
Checkmk Community, Checkmk Pro, Checkmk Ultimate, Checkmk Cloud, Checkmk Ultimate MT
2.5.0p9
Not yet released
Checkmk Community, Checkmk Pro, Checkmk Ultimate, Checkmk Cloud, Checkmk Ultimate MT
2.4.0p34
Not yet released
Checkmk Community, Checkmk Pro, Checkmk Ultimate, Checkmk Cloud, Checkmk Ultimate MT

Some management controllers (notably Dell iDRAC) intermittently answer the central GET /redfish/v1/Systems request with a transient error such as HTTP 503 or 404, while all other endpoints respond normally. Because that request is the parent of every system-scoped section, a single such failure made the agent emit a successful but incomplete result, and the affected services — CPUs, memory, storage, drives, volumes and network interfaces — briefly vanished from the monitoring (Item not found in monitoring data).

The special agent now retries the system-data request on a transient failure before giving up. By default it retries 3 times with a 2 second delay; this is configurable per rule under Redfish Compatible Management Controller → Retry fetching system data (set the number of retries to 0 to disable retrying).

If the request still cannot be fetched after the configured retries, the agent now aborts the run instead of publishing an incomplete dataset. As a result the monitoring keeps the previously collected data and the system services no longer disappear during a short controller hiccup; the Check_MK service of the affected host turns CRITICAL for that interval instead.

Per-section resilience is unchanged: a failure of an individual section (e.g. a single drive) still only affects that section.

No user action is required. To tune the behaviour, adjust the new rule option.

To the list of all Werks