Werk #19986: Redfish: retry transient Systems failures and keep previous data instead of dropping services
| Component | Checks & agents | ||||||
| Title | Redfish: retry transient Systems failures and keep previous data instead of dropping services | ||||||
| Date | Jun 12, 2026 | ||||||
| Level | Trivial Change | ||||||
| Class | Bug Fix | ||||||
| Compatibility | Compatible - no manual interaction needed | ||||||
| Checkmk versions & editions |
|
Some management controllers (notably Dell iDRAC) intermittently answer the
central GET /redfish/v1/Systems request with a transient error such as
HTTP 503 or 404, while all other endpoints respond normally. Because that
request is the parent of every system-scoped section, a single such failure
made the agent emit a successful but incomplete result, and the affected
services — CPUs, memory, storage, drives, volumes and network interfaces —
briefly vanished from the monitoring (Item not found in monitoring data).
The special agent now retries the system-data request on a transient failure before giving up. By default it retries 3 times with a 2 second delay; this is configurable per rule under Redfish Compatible Management Controller → Retry fetching system data (set the number of retries to 0 to disable retrying).
If the request still cannot be fetched after the configured retries, the agent
now aborts the run instead of publishing an incomplete dataset. As a result the
monitoring keeps the previously collected data and the system services no
longer disappear during a short controller hiccup; the Check_MK service of the
affected host turns CRITICAL for that interval instead.
Per-section resilience is unchanged: a failure of an individual section (e.g. a single drive) still only affects that section.
No user action is required. To tune the behaviour, adjust the new rule option.