Monitoring with SNMP: Stories from hell

By Timo Scheibe on Jul 9, 2020

Reading time 5 minutes

Those who monitor their network environment with SNMP benefit from the fact that many manufacturers support the Simple Network Management Protocol on their devices. As the protocol has become the de facto standard in network monitoring since its introduction over 30 years ago, almost every manufacturer has integrated an SNMP agent on their devices. Unfortunately, however, some do not handle the SNMP implementation on their components as required by the standard.

Since many Checkmk users monitor their network environments with SNMP, we know from first hand that bad SNMP implementations are unfortunately not uncommon among manufacturers. In this article we want to present some typical examples of faulty SNMP implementations from our pool of experience.

SNMP delivers incorrect data

If the developers of a manufacturer do not implement the SNMP protocol properly, it can happen that the agent delivers confusing or even incorrect data, is not performant, does not comply with the SNMP standard or simply does not work. For monitoring the network environment, this poses various problems, as it is more difficult to obtain the required data.

Often it is possible to get the necessary monitoring information with Checkmk via different standard MIBs, such as Host-Resources and UCD. This is problematic, however, for example, if both MIBs supply different values of the main memory. This can be due to the fact that one MIB provides a wrong value while the other one transmits the correct information. Theoretically, it can also happen that both MIBs provide wrong values. You can read how OIDs and MIBs work in our first article about network monitoring with SNMP .

If the SNMP agent delivers contradictory or even incorrect data, the administrator must attempt by trial-and-error to find out which value is correct or incorrect and remove the incorrect information from monitoring. This sounds very complex. The good news is that Checkmk often already knows which values are incorrect, so it automatically ignores the incorrect information. This has to do with how Checkmk handles SNMP monitoring:

When monitoring via SNMP, Checkmk makes a complete pull of all SNMP data (SNMP walk) during the service discovery, and then looks for the interesting information. Since this would take several hours for some devices, Checkmk first calls only the first two OIDs, the sysDescr and sysObjectID. The sysDescr is the description of the device, which the manufacturer has defined in the firmware. This text is important for Checkmk to automatically detect the services. Depending on the device detected, further queries are then made to determine which of the more than 1,000 SNMP check plug-ins supplied with Checkmk are suitable for the device. The result from a scan of a switch could look like this, for example:

SNMP scan found hr_mem if64 ucd_mem mgmt_snmp_uptime snmp_info snmp_uptime SNMP filtered check plugin names hr_mem if64 snmp_info snmp_uptime

As we can see, Checkmk has in part found relevant data at several positions in the OID tree – here for example mgmt_snmp_info and snmp_uptime – enabling it to filter the final list of matching checks. If it is already known that one of these found OIDs returns incorrect values, Checkmk sorts them out directly. In this example, Checkmk has chosen hr_mem because this device is known to provide incorrect data in ucd_mem.

In the next step, the actual service recognition will then run. The check plug-ins identified in the scan phase now use focused SNMP requests to retrieve the data they need. Based on the values received, they then determine the services to be monitored. This is the so-called 'Discovery Phase'.

Note: For historical reasons, the underlying Checkmk function is still called inventory_function in the code. With the introduction of the hardware and software inventory, we changed the display in the GUI to Check_MK Discovery to avoid any misunderstandings.

The third and last phase is the so-called Check Phase. This is executed at each check cycle and retrieves the OIDs that Checkmk found in the discovery phase.

1⁄10 apples or pears?

As we have seen, it can happen that devices deliver one or more incorrect values. However, this is by far not the only possibility for error. Over and over again, undefined or incorrectly declared units provide sufficient food for discussion during our developer's coffee breaks. So it happens from time to time that the device manufacturer specifies degrees Celsius as the unit, but the implementing developer outputs Fahrenheit in the OID.

There was once also a very creative solution in which although the manufacturer's documentation spoke of degrees Celsius, there was no clear indication that the output value in fact corresponded to 1⁄10 °C.

Inadequate documentation or creative implementation on the part of the manufacturer can also lead to further confusion. Using the output of '0°C' to signal that a sensor is defective or not available can obviously lead to misunderstandings in the field of ambient temperature measurement – in the worst case even to serious mistakes.

These two examples show that it can be deceptive for the user to blindly rely on the data queried via SNMP. It makes it clear that SNMP monitoring requires the necessary know-how, and that no generally valid formula can be derived for every SNMP device. Instead, each device must be considered individually. Only with the necessary knowledge of the context can the collected data be interpreted correctly.

“What day is today?" – “February!"

It is not surprising that the implementation of the SNMP protocol is not immune to typical programming errors. Therefore, common errors familiar from software programming can also be found in SNMP implementations. This can be well illustrated by the date format, for example. Instead of using the ISO format for this, there is often a mix of the various locally common variants: YYYYMMDD, MMDDYYYY or DDMMYYYY. For example, the specification 05022020 could mean February 5th, 2020 or May 2nd, 2020. If the year is shortened to two digits, the date format 050220 can even be interpreted as 20th February 2005.

This means that the user or check developer must also deal with the details in such a case. For example, we have already had the case where a manufacturer has provided three digits for the day in its SNMP stack in the date format: 001 thus corresponds to the first day of each month. Sure, you can do that. But should you...?

See you in a minute...

When it comes to network monitoring with SNMP, the constant load caused by SNMP polling – i.e. the polling of data in a certain time interval – on the devices is often mentioned. In earlier times, when monitoring systems were not so powerful and could only query data in intervals of several minutes, this did not have such a great effect on the network components. In the meantime, however, monitoring solutions have become much more powerful.

By default, Checkmk polls the devices in one minute intervals. As already described in this article, it automatically detects the required OIDs for the services to be monitored. With this efficient procedure Checkmk already keeps the load on the devices as low as possible. Fortunately, most modern devices have no problems with a one minute polling interval. However, practice shows that older devices or switch stacks above a certain size can reach their performance limits

The OID Allergy

From time to time it can also occur that devices have an ‘allergic reaction’ to the OID sequence queried in a check. For example, if the OIDs a queried with a snmpbulkget with a bulk size of 10, it can happen that the SNMP stack crashes at a certain point and returns no, or only incomplete values. In our next blog post we will take a closer look at this problem and explain how to work around it. For check developers, for example, a solution could be to change the order of the queried OIDs to avoid the SNMP stack crashing.

It also sometimes happens that a MIB is not backwards-compatible – which by default it should be. This is problematic if the firmware changes the OID implementation. Depending on which firmware is installed on the network device, you will need to know under which OID the value you are looking for can be found.

UDP-Odyssey

But not only SNMP pollings have disadvantages. SNMP traps also often do not work as promised. A faulty implementation can prevent an automatic trap from being triggered despite an event having occurred.

However, this is again problematic if the manufacturer advertises his device as having SNMP support, but actually has only integrated event-based traps, and thus does not allow SNMP polling for the monitoring data at all. In combination with the other disadvantages of SNMP traps, this can cause problems with network monitoring.

Finally, monitoring with traps is much more error-prone than SNMP monitoring using active requests. One of the reasons for this is the unreliability of SNMP traps. These are sent as UDP packets that can get lost. Due to the missing receipt verifications, the packet loss remains undetected – i.e. the recipient does not even know that a notification was sent. SNMP traps also only send error messages, but no recovery notifications, so the current status in monitoring may also remain unclear.

In addition, if a major upstream service fails in a large network environment, thousands of switches may send traps simultaneously. Under the burden of so many messages, the trap receiver may break down and monitoring may not be available at the precise moment when the user needs it most.

The Event Console

With the Event Console Checkmk provides an integrated system for avoiding such an overloading of the trap receiver. For example, the Event Console processes the messages triggered by an event using the Event Daemon (mkeventd) rather than the monitoring core. The purpose of the Event Console is to filter only the relevant messages from a large stream of messages.

For this purpose, the Event Console can, among other things, directly-receive messages via syslog or SNMP traps and classify them according to user-defined rules. The engine takes the approach that the first applicable rule generates an event from the message. Furthermore, it is able to correlate, summarize, count, annotate, rewrite the messages, and take their temporal relationships into account.

Note: Checkmk processes rules as rule chains, i.e. it merges different parameters from several rules in the same rule set. In the Event Console, only the first applicable rule decides whether a message becomes an event – or whether it is discarded.

In addition to a possible overflow of the receiver, event-based monitoring with SNMP traps can also mean considerable configuration efforts. If the IP address of the trap receiver is changed, the administrator must reconfigure all devices. It is even more annoying if a firmware update changes the trap OIDs. Then the administrator has to rewrite all rules – if he notices the change at all. This is because the lack of testing options for traps is another disadvantage of event-based monitoring. Unfortunately only few devices are even capable of sending generic traps or even tests of real error messages. This makes it difficult to predict whether an important trap will work correctly at all, and whether a message will even be triggered when an event occurs.

An oldie but a goldie

So, all in all, it can be said that SNMP is a protocol from the 1980s that has established itself as a de facto standard – at least for network monitoring. Even if it does get stuck at one point or another, SNMP will continue to be a central pillar for monitoring network components. Despite some shortcomings – especially in times of increasing numbers of devices – it allows a fast monitoring of your own IT infrastructure.

Nevertheless, the examples shown have made it clear that it is necessary to deal individually with each device that you integrate into your monitoring environment. Generic SNMP checks can otherwise quickly lead to incorrect data and values in your own monitoring instance. This can result in false findings and conclusions concerning the state of the network environment being monitored.

This situation can only be avoided with specially-written checks and domain expertise of the developers. Depending on the selected monitoring solution, the administrator may have to do this manually. Checkmk, on the other hand, offers the possibility of automatically detecting the services relevant for monitoring with the help of the more than 1,000 supplied SNMP check plug-ins. In this way, monitoring can be set up within a short time with SNMP.

In addition, Checkmk also provides tools for administrators and users for fixing some of the problems caused by poor SNMP implementations. In our next blog post you will learn all about the ‘God Mode’, and how it can help you with SNMP troubleshooting.