Werk #18748: liveproxyd: Do not abort client response on transient send-buffer pressure
| Component | Livestatus | ||||||
| Title | liveproxyd: Do not abort client response on transient send-buffer pressure | ||||||
| Date | May 8, 2026 | ||||||
| Level | Trivial Change | ||||||
| Class | Bug Fix | ||||||
| Compatibility | Compatible - no manual interaction needed | ||||||
| Checkmk versions & editions |
|
The liveproxyd response-sending loop polled the local client socket for write
readiness with a hardcoded two-second select() and treated an empty result as
fatal, raising ClientResponseException("Client socket is unwilling to accept
response data") and emitting Unhandled exception to var/log/liveproxyd.log.
Code, surrounding comment, and the original commit show that the two seconds
were intended as a wake-up cadence to re-check the channel-shutdown signal,
not as a deadline. The actual response deadline is the per-query timeout
(default 120s, configurable via liveproxyd_default_connection_params) and is
already enforced separately in the same loop.
The exception was visible in setups where the central site queries many
distributed sites in a single REST API call (for example host/collections/all
with many columns=). The Apache CGI process consumes site responses
serially, so while it is busy reading or parsing one site's response, other
sites' kernel send buffers can fill on the local unix socket. Today's loop
treated that transient back-pressure as a fatal error, even though the request
typically had well over 100 seconds of remaining budget. On the affected
distributed sites this manifested as
SSL routines::unexpected eof while reading.
The send loop now uses the two-second select() purely as a polling cadence
and keeps retrying until the socket becomes writable, the channel is asked to
terminate, or the configured query timeout is exceeded. As a related cleanup,
when the peer closes the local socket mid-send, the resulting BrokenPipeError
from send() is caught and reported as Client closed connection while sending
response instead of propagating as an unhandled exception. No configuration
change is required.