Checkmk Conference #6 goes digital. Get your tickets here!
An important advantage with Checkmk compared to other monitoring systems is the large number of well-maintained check plug-ins supplied as standard. In order for these plug-ins to have a uniformly high quality there are standardised criteria that each plug-in must meet. In this article we will introduce these criteria.
An important note regarding the criteria: please don't simply assume that all plug-ins supplied with Checkmk conform to all current standards. Please avoid copy & paste. It is more advisable to orient your work according to the information in this article.
For check plug-ins that are official components in Checkmk, or are planned to be such, higher quality is demanded in comparison to those written “for your own use”. This expectation applies to their ‘outer’ quality (as seen by the user) as well as to their internal quality (the readabilty of code, etc.).
2.2. The basics of a check plug-in
Every check plug-in must as a minimum include the following components:
- The check plug-in itself
- A Manpage
- Plug-ins with check parameters require a WATO-Definition for the applicable Rule Set
- Metrics definitions for graphs and the Perf-O-Meter if the check produces metrics data
- A definition for the Agent Bakery if an agent plug-in is present
- Numerous complete diverse examples of the agents outputs, respectively SNMP-Walks
3. Naming conventions
3.1. Plug-in names
Choosing a name for a plug-in is especially critical since this name cannot be altered at a later date.
- A plug-in's name must be short, sufficiently specific and understandable. Example: firewall_status is only a good name if the plug-in functions for all or at least many firewalls.
- A name is composed of lower case letters and numerals. An underscore is permitted as a separator.
- The words status or state are unnecessary in a name, since of course every plug-in monitors a status. The same applies for the superfluous word current. So, rather than foobar_current_temp_status simply use just foobar_temp.
- Checks that apply to a physical thing (e.g. fan, power supply), should have a name in the singular – for example, casa_fan, oracle_tablespace. Checks in which each item refers to a number or multiples should be named using a plural – for example, user_logins, printer_pages.
- Product-specific checks should be prefixed with the product name – e.g., oracle_tablespace.
- Manufacturer-specific check plug-ins that do NOT apply to a SPECIFIC product should be prefixed with the manufacturer's abreviation – e.g., fsc_ for Fujitsu Siemens Computers.
- SNMP-based check plug-ins that use a common component of the MIB which may well be supported by more than one manufacturer should be named after the MIB, rather than after a manufacturer – e.g., the hr_*-check plug-ins.
3.2. Service names
- The name of a service containing various check plug-ins which internally perform in the same way should be the same – thus, for example, always use interface if it applies to a network interface. This makes the creation of rules easier for the user.
- Naming schemes have already been established for particular types of checks. These must be used.
- Code ‘fan’ – not ‘FAN’ – since this is not an abreviation.
- Abbreviations are always written in capitals, eg. "CPU".
- As a general rule the first letter is written in upper case, the rest in lower case. Eg. "Bonding interface %s". Exceptions are proper names or items.
3.3. Names for metrics
- Metrics for which a meaningful definition already exists should always be used.
- Otherwise similar rules as used for the naming of check plug-ins apply (product-specific, manufacturer-specific, etc.)
3.4. WATO-rule group names
- The same convention applies as with metrics.
4. Constructing check plug-ins
4.1. General structure
The actual Python data under share/check_mk/checks/ should have the following structure (complying with the coding sequence):
- A file header with a GPL-Notice
- The name and email address of the original author if the plug-in has not been developed by the Checkmk-Projekt.
- A short sample of the agent's output
- Default values for the check parameter (factory_settings)
- Help functions, if available
- The Parse function, if available
- The Discovery function
- The Check function
- The check_info-declaration
4.2. Coding guidelines
- If the plug-in has not been developed by the Checkmk-Team, the author's name and email address should be coded directly after the GPL-header.
- Avoid long lines of code – the maximum permitted length is 100 characters.
- In each case the indentation is four blank characters – do not use tabs.
- Orientate yourself to Python-Standard PEP 8
Sample agent outputs
Including a sample of an agent's output greatly simplifies the reading of the code. When doing so it is important to include various possible outputs in the sample. Make the sample no longer than necessary. With SNMP-based checks provide an SNMP-Walk:
Example excerpt from SNMP data: .220.127.116.11.18.104.22.168.22.214.171.124.1.0 255 .126.96.36.199.188.8.131.52.184.108.40.206.220.127.116.11 1 .18.104.22.168.22.214.171.124.126.96.36.199.188.8.131.52 "Good" .184.108.40.206.220.127.116.11.18.104.22.168.22.214.171.124 "No critical or warning events" .126.96.36.199.188.8.131.52.184.108.40.206.220.127.116.11 "No timestamp"
If, for example, due to differing firmware standards in the target devices differing output formats are produced, then an example noting the version should be provided for each. A good example of this case can be found in the multipath check plug-in.
When defining the snmp_info the readable path to the OID should be given in the comments. Example:
'snmp_info' : [(".18.104.22.168.22.214.171.124.1.1.1", [ OID_END, "2", # ENTITY-MIB::entPhysicalDescription "5", # ENTITY-MIB::entPhysicalClass "7", # ENTITY-MIB::entPhysicalName ]),
Avoid complex expressions with lambda. Permitted is lambda in the lambda oid: ... scan function, and when you wish to invoke existing functions with only an altered argument – for example:
"inventory_function" : lambda info: inventory_foobar_generic(info, "temperature")
Iterating through SNMP-agent data
With checks that parse SNMP-data, an index like this should not be used...
for line in info: if line != '' and line ...
It is better to unpack each line as meaningful variables:
for sensor_id, state_state, foo, bar in info: if sensor_state != '1' and sensor_id ...
Always use parse functions whenever parsing an agent's output is not trivial. The parse function's argument should always be named info, and in the discovery and check functions the argument should be named parsed instead of info. In this way it will be clear to the reader that this result is from a parse function.
Checks with multiple partial results
A check that produces multiple partial results – for example, current allocations and growth – must return these with yield. Checks that produce only a single result must use return.
if "abs_levels" in params: warn, crit = params["abs_levels"] if value >= crit: yield 2, "...." elif value >= warn: yield 1, "...." else: yield 0, "..." if "perc_levels" in params: warn, crit = params["perc_levels"] if percentage >= crit: yield 2, "...." elif percentage >= warn: yield 1, "...." else: yield 0, "..."
The (!) and (!!) markers are obsolete and may no longer be used. These should be replaced by yield.
Keys in check_info[...]
Only store keys which will be used In your entry in check_info. The only required entries are ‘service_description’ and ‘check_function’. Only insert ‘has_perfdata’ and other keys with boolean values if their value is True.
4.3. Agent plug-ins
If your check plug-in requires an agent plug-in, then be aware of the following rules:
- Store the plug-in in share/check_mk/agents/plugins for Unix-type systems, and set the execution rights to 755.
- In Windows the directory is called share/check_mk/agents/windows/plugins.
- Shell and Python scripts should have no file name extension (omit .sh and .py).
- Use #!/bin/sh in the first lines of shell scripts. Only use #!/bin/bash if BASH features are required.
- Use the standard Checkmk-file heading with the GPL-notice.
- Your plug-in must not damage the target system, especially if the plug-in is not actually supported by the system.
- Remember to note the plug-in in the check's manpage.
- If the component that the plug-in is to monitor doesn't actually exist on a system, the plug-in must not output a section head.
- If the plug-in requires a configurations file this should (in Linux) be searched for in the $MK_CONFDIR directory, and the file must have the same name as the plug-in – apart from the .cfg extension, and without a possible mk_ prefix. The procedure is similar for Windows – the directory in Windows is %MK_CONFDIR%.
- Do not code plug-ins for Windows in Powershell. This is not portable, and is in any case very resource-greedy. Use VBS.
- Do not code Plug-ins in Java.
- Do not use import in your check files. All permitted Python modules have already been imported.
- Do not use datetime for parsing and calculating time specifications – use time. This can perform all needed tasks. Really!
- Arguments that receive your functions must in no way modify the functions. This especially applies for params and info.
- Should you really want to work with regular expressions (they are slow!), invoke these with the regex() function. Do not use re directly.
- Naturally it is not permitted to use print, or otherwise route outputs to stdout, or communicate with the outside world in any way!
- The SNMP-scan function is not allowed to retrieve OIDs other than .126.96.36.199.188.8.131.52.0 and .184.108.40.206.220.127.116.11.0. Exception: the SNMP-scan function has ensured via a Check of one of these OIDs, that further OIDs will retrieve only a strictly-limited number of devices.
5. Behaviour of check plug-ins
Your check plug-in should not, rather it must always assume that an agent's output is syntactically valid. The plug-in is in no case permitted to attempt to handle unknown error situations in the output itself!
Why is this so? Checkmk has a very refined function for automatically handling such errors. For the user it can generate comprehensive crash reports, and it also sets the status of the plug-in to UNKNOWN. This is much more helpful than if the check, for example, simply produces an unknown SNMP code 17.
5.2. saveint() and savefloat()
The saveint() and savefloat() functions convert a string into an int or float and produce a 0 if the string cannot be converted (e.g. it is an empty string).
5.3. Item not found
A check that doesn't find an item being monitored should simply produce a None, and not generate its own error message. In such a case Checkmk will produce a standardised, consistent error message, and set the service to UNKNOWN.
Many check plug-ins have parameters which define thresholds for specific metrics, and thus determine when the check assumes a WARN or CRIT status. Please be aware of the following rules that ensure Checkmk reacts consistently.
- The thresholds for WARN and CRIT should always be verified with >= and <=. Example: a plug-in monitors the length of a mail queue. The critical upper limit is 100. This means that if the actual value is '100' it is already critical!
- If there are ONLY upper, or ONLY lower thresholds (the commonest cases), then the entry fields in WATO should be coded with Warning at ______ and Critical at ______.
- If there are upper AND lower thresholds, the coding should be as follows: Warning at or above ___, Critical at or above ___, Warning at or below ___ and Critical at or below ___.
5.5. Check plug-in outputs
Every check produces one line of text – the plug-in output. To achieve a consistent behavier for all plug-ins, the following rules apply:
- For showing measured values, exactly one blank character should separate the value and the unit (e.g. 17.4 V). The only exception to this rule is with %, where there is no blank: 89.5%.
- When listing measured values, the value's name with an initial capital is followed by a colon. Example: Voltage: 24.5 V, Phase: negative, Flux-Compensator: operational
- Do not show internal keys, codewords, SNMP-internals or other rubbish in plug-in outputs which is of no use to the user. Use meaningful human-readable terms. Use terms that the user normally expects! Example: Use route monitor has failed rather than routeMonitorFail.
- If the check item has an additional specification, code this in square brackets at the beginning of the output (e.g. Interface 2 - [eth0] ...)
- In listings, items are separated by commas, and following items have initial capitals: Swap used: ..., Total virtual memory used: ...
5.6. Default thresholds
Every plug-in that works with thresholds should have meaningful default threshold values defined for it. The following rules apply:
- The default thresholds used in the check should also be defined 1:1 as default parameters in the applicable WATO-rule.
- The default thresholds should be defined in factory_settings (if the check has a dictionary as a parameter).
- The default thresholds should be selected on a technically-sound basis. Is there a manufacturer's specification? Are there best Practices?
- It is essential that the source of the thresholds be documented in the check.
5.7. Nagios vs. CMC
6.1. Formats of metrics
- The check plug-in always returns metric data as int or float. Strings are not allowed.
- If you wish to output the sixtuple from a metric value field, use None in its postition. Example: [("taple_util", utilization, None, None, 0, size)]
- If you don't require the entry at the end, simply shorten the tuple. Do not use a None at the end.
6.2. Naming the metrics
- Metric names consist of lower case letters und underscores. Numerals are permitted, but not leading.
- Metric names should be, as with check plug-ins, short and specific. Metrics that will be used by multiple plug-ins should have generic names.
- Avoid using the pointless filler word current. The measured value is always the current one.
- The metric should be named after the ‘thing’, not after the unit. Thus, for example, current rather than ampere, or size rather than bytes.
- Important always use the canonical size. Really! Checkmk scales the data itself as appropriate. Example:
|Network throughput||Octets per second (not bits/sec!)|
|Percentage value||A value from 0 to 100 (not 0.0 to 1.0)|
|Events per time period||1 per second|
|Electrical performance||Watts (not mW)|
6.3. Flags for metric data
- Only set ‘has_perfdata’ in check_info to True if the check actually outputs metric data (or can output it).
6.4. Definitions for graphs and the Perf-O-Meter
The definitions for graphs should be like the definitions in web/plugins/metrics/check_mk.py. Do not create definitions for PNP-graphs. In the Raw Edition as well these will be generated on the basis of the metric definitions in Checkmk itself.
7.1. Check group names
Check plug-ins with parameters require a compulsory WATO-rule definition. The connection between a plug-in and a rule is made through the check group (the entry ‘group’ in check_info). All checks that are configured with the same rule set are consolidated via the group.
If your plug-in should sensibly be configured with an existing rule set, then also use an existing group.
If your plug-in is so specific that it in any case requires its own group, then create an own group for it where the group's name should reference the plug-in.
7.2. Default values for ValueSpecs
When defining your parameter definitions (ValueSpecs) use the exact same default values as the defaults actually used in the checks (if possible). Example: if without a rule the check assumes the threshold (5, 10) for WARN and CRIT, then the ValueSpec should be so defined that 5 and 10 will be automatically offered as thresholds.
7.3. Choosing ValueSpecs
For some types of data there are specialised ValueSpecs. An example is Age for a certain number of seconds. This must be used wherever it is appropriate. Do not, for example, use Integer in such a case.
8. Include files
For a number of types of checks there are already prepared implementations in include files, that not only can be used, but SHOULD be used:
|temperature.include||Monitoring of temperatures|
|elphase.include||Electrical AC phases (e.g. in USV)|
|df.include||File system levels|
|mem.include||Monitoring of RAM (Main storage)|
|ps.include||Operating system processes|
Every check plug-in must have a Manpage. If multiple plug-ins (subchecks) have been programmed in a check file, each one must naturally have its own Manpage.
The Manpage has been conceived for the user! Write helpful information there. It's not just a matter of you documenting what you have programmed, rather it provides important, useful information to the user.
The Manpage must be:
9.1. Catalogue entries
Using the catalog: header you can specify where the plug-in is to be stored in the check manpages' catalogue. If a category is missing (for example, a new manufacturer) this must be defined in the cmk/man_pages.py file, in the catalog_titles variable – or from Version 1.6 in the file cmk/utils/man_pages.py.
Currently this file cannot be extended with plugins in local/, so that only the developers of Checkmk can make changes here.
Please note the correct upper and lower case spelling of product and company names! This applies not only for the catalogue entry, but also for all other texts where these names appear. Example: NetApp is always written NetApp, and not netapp, NETAPP, Netapp, or the like. Google can help to find the correct spelling!
The following information must be included in the Manpage's description::
- Exactly which hard or software is monitored by the check? Do any of the devices' firmware or product versions have special features? Do not relate these to an MIB, rather to products' identification codes. Example: It is of no help to the user to write “This check functions for all devices which support Wrdpfrmpft-17.11-MIB”. Rather write which product lines or the like are supported.
- Which aspect of the device will be monitored? What does the check do?
- Under which conditions will the check return OK, WARN or CRIT?
- Will the check require an agent plug-in? If yes, how will this be installed? That must also be possible without the Agent Bakery.
- Are there further prerequisites in order for the check to function – preparation of the target system, the installation of drivers, etc? These should in any case only be specified if they are not normally fulfilled, e.g., the mounting of /proc under Linux).
Under inventory: write under which conditions this check's service(s), are automatically found. An example from nfsmounts:
inventory: All NFS mounts are found automatically. This is done
With checks that have an item (thus also a %s in their service names), in the Manpage under item: it must be described how this is to be generated.