Werk #8350: Real-time checks: Introducing checking in one second resolution
Component | The Checkmk Micro Core | ||
Title | Real-time checks: Introducing checking in one second resolution | ||
Date | Dec 16, 2015 | ||
Level | Prominent Change | ||
Class | New Feature | ||
Compatibility | Compatible - no manual interaction needed | ||
Checkmk versions & editions |
|
This release introduces a new feature named real-time checks. With the new real-time checks it is possible to monitor specific things a lot shorter intervals than the normal interval of 60 seconds.
This feature has mainly been developed to get detailed graphs for some values which change often like for example the memory usage or CPU utilization. But not only the performance data is updated in this interval. The complete service state is updated which may also result in faster notifications.
The real-time checks are working like this: The core is listening on the network for incoming real-time check results, which are basically UDP packets sent by the agents in an interval of one second. This needs to be enabled using the configuration option Global Settings > Enable handling of Real-Time Checks. You need to configure the UDP port to listen on (6559 by default) and the secret which is used to decrypt the real-time checks. This secret needs to be equal for the Check_MK server and all agents which are sending real-time check results.
The agents need to be configured to send real-time check results. This can currently be done for the Linux and Windows Agents. On linux you need to create a file /etc/check_mk/real_time_checks.cfg with the following contents:
/etc/check_mk/real_time_checks.cfg
RTC_TIMEOUT=90
RTC_PORT=6559
RTC_SECRET='hallo123'
RTC_SECTIONS=""
RTC_SECTIONS+="mem "
RTC_SECTIONS+="cpu "
It is a good idea to reduce permissions on this file because it contains the real-time check secret which is shared between the Check_MK agent and the server to encrypt the transfered data. For example chmod 640 /etc/check_mk/real_time_checks.cfg is a good idea.
On Windows you need to add the following to the [global] section of your check_mk.ini and restart the Check_MK service:
check_mk.ini
realtime_port = 6559
realtime_sections = mem winperf
realtime_timeout = 90
passphrase = hallo123
The agent is working as usual, waiting for connections from the Check_MK server. Once a Check_MK server is contacting the agent, the agent is responding with it's regular response. Now, when real-time checks are enabled, the agent is sending one UDP packet for each enabled section per second to the host address which had queried the Check_MK agent, which is normally the Check_MK servers address.
The data which can be processed as real-time check is limited, so we limit the sections which can be sent as real-time checks. Currently you can enable only the mem and cpu sections on linux and mem and winperf on Windows systems. This might be extended in the future.
To get detailed graphs, you now need to configure your RRD databases to be able to store these detailed information. This can be done via the ruleset Host & Service Parameters > Monitoring Configuration > Configuration of RRD databases of services.
You need to create a new rule and first need to ensure that you only apply the rule to checks which get real-time check information as the RRDs of these services need more disk space. So you should only select the CPU/Memory services of hosts which are sending real-time check results. Then you need to configure this rule to have a 1 second precision for a duration of your choice.
Just one example configuration for having:
- 1 second resolution for 4 hours
- 1 minute resolution for 2 days
- 5 minute resolution for 10 days
- 30 minute resolution for 90 days
- 6 hour resolution for 4 years
You need to configure these numbers:
- Step (precision): 1 sec.
- RRA configuration:
- 50.0%, 1, 14400
- 50.0%, 60, 2880
- 50.0%, 300, 2880
- 50.0%, 1800, 4320
- 50.0%, 21600, 5840
After you configured this, you need to run cmk --convert-rrds -v to convert the existing RRDs.
After the conversion has finished and processing of the real-time checks works correctly, you should see the service state, output and graphs e.g. of the "CPU utilisation" service updating in an interval of one second.