Werk #18747: agent_elasticsearch: Handle HTTP errors and produce deterministic section output
| Component | Checks & agents | ||||||
| Title | agent_elasticsearch: Handle HTTP errors and produce deterministic section output | ||||||
| Date | May 7, 2026 | ||||||
| Level | Trivial Change | ||||||
| Class | Bug Fix | ||||||
| Compatibility | Compatible - no manual interaction needed | ||||||
| Checkmk versions & editions |
|
The Elasticsearch special agent could intermittently produce partial or empty output when one of the queried sections returned a non-2xx HTTP response. The agent passed the response body to the JSON decoder without checking the status code, so an error response was parsed as if it were valid data, causing a validation error that the outer exception handler swallowed and that aborted all remaining sections.
Three issues are fixed:
-
The nodes endpoint is narrowed from
/_nodes/_all/statsto/_nodes/stats/process. The agent only ever consumed theprocesssub-tree of the response, so requesting the rest needlessly increased the payload and exposed the agent to upstream serialization bugs in unused stats categories. This was the trigger seen on AWS OpenSearch, which occasionally returns HTTP 400 from/_nodes/_all/statsbecause of negative byte counts (integer overflow) in stats categories the agent does not read. If a future check needs JVM, filesystem or other categories, the URL must be broadened again. -
The HTTP status code is now checked before the response is decoded as JSON. Non-200 responses are logged to stderr and the affected section is skipped, letting the remaining sections run.
-
The list of sections to query is now iterated in a fixed order (cluster_health, nodes, stats) instead of in the iteration order of a
set(). A failure in one section can no longer retroactively suppress the output of an earlier successful section.