Checkmk at EDEKA
Installations with more than 1200 independent Nagios-Servers are certainly not an everyday event. Generally the management of such environments is more than a fulltime job. But this is not the only impressive number within this Nagios project: As of today each of those 1,200 servers monitors an average of 25 hosts and 250 services. This sums up to 300,000 services that are requested once every minute respectively! All this information is then lumped together in one instance - visualizing and evaluating it
EDEKA Minden-Hannover in brief
EDEKA Minden-Hannover with a turnover of 6.9 billion Euros in 2011, 32,000 employees and around 1,600 stores is the largest of seven EDEKA regional businesses in Germany. The regional market reaches from The Netherland's to Poland's borders. It covers a part of Eastern Westphalia, most of Lower Saxony, and likewise the federal states of Bremen, Sachsen-Anhalt, Berlin and Brandenburg.
The starting situation
EDEKA Minden-Hannover required a cost-efficient, flexible and modern solution for monitoring its stores and a central system. The open source software Nagios was quickly being discussed. As in many other businesses Nagios was initially installed by hand and extended bit by bit with add-ons like NSCA, NagVis etc. It had become clear that such a classically-installed Nagios had soon reached its limits. A problem point was the heavy network load due to the many single Checks. Furthermore it turned out to be that the graphical configuration tools as well such as NConf were at the limits of their capacity to manage. In time the performance bottlenecks were also making themselves recognisable on the central Nagios-Server. The first contact with tribe29 GmbH came through the Nagios add-on Checkmk. Together over a three day workshop ideas and concepts for a modern, agile monitoring were developed.
The main requirements
- The monitored data should converge at a central location in order to have an overview of the complete environment.
- Monitoring in the stores should also function when a store is separated from the network.
- There should be the possibility for every store to view local information. The status should also be displayed in a form understandable for employees without specialised technical knowledge.
- The solution should be robust against limited bandwidths and unstable connections between the central processor and the stores.
- For the central status monitor a GPS-coordinate-based overview in the form of a map should be generated.
- New systems in stores should be automatically included in the monitoring on a regular basis.
- The checks in the existing Nagios-System based on Nagios + NSCA should be absorbed.
- The quality of the monitoring of critical systems should be improved.
- Die Daten jedes Marktes sollten zu einem Gesamtzustand aggregiert werden. Beispielsweise sollte ein Markt einen kritischen Zustand melden, wenn weniger als ein festgelegter Prozentsatz an Kassen arbeitsfähig ist.
- The installation should be done on the existing system (SuSE Linux Enterprise Server 10).
With this project the challenge was not only the large scale, but also the special requirements that come from the retail environment.
The implemented solution utilises the current 1.1.13 version of Checkmk. In every store there is a Nagios installation based on Open Monitoring Distribution (OMD) and Checkmk. Using OMD ensures a simple and standardised installation of the monitoring server. In this way the distribution of the installation and configuration can be applied to existing tools. The Nagios-Instance in a store is responsible for all local systems and for monitoring the central services from the store's point of view. Furthermore, based on the collected data a dashboard can be created to display the local status. Checkmk retrieves data using its own agent that does not need to be configured. The rollout on the large scale of the Systems to be monitored is possible without problems. With the inventory function of Checkmk it is automatically determined what can be monitored on a system. The thresholds are configured via flexible rules on the central server. The store's local network regularly and automatically scans for new components. At the same time the standard tool nmap is activated. When a new system is located, with the help of Checkmk's automatic inventory the new system's services to be monitored are detected. This process takes place fully automatically without manual intervention. In this way the effort for managing the monitorings in the stores is minimised.
The statuses of all of a store's systems are combined into an overall status. This aggregation is achieved with the aid of Checkmk Business Intelligence (BI). In the aggregation a common formulated system of rules can be applied to a store's existing systems. This rule system consists of 26 different rules. In this way the large effort required for an explicit configuration can be avoided.
From the central instance the total status for every store is requested. In oder to achieve this the JSON-based webservice of Checkmk is utilised. The central instance requests the status of the store's aggregation every minute. This way it is ensured that the information in the central location represents the current status in the store.
In order for the environment to be clearly displayed a dashboard is created in Checkmk Multisite, that as an overview page is shown continuously on two 55-inch TFT Monitors. On this dashboard are found individual views (Dashlets), that among other things show stores with connection problems, and show host or service problems in their own lists.
In the framework of the project NagVis is extended to the Geomap functionality. With this, NagVis with the the GPS-coordinates of all locations and freely accessible map material from the Openstreetmap project, a map is created on which all of the locations are shown.
The central Nagios system was conceived and assembled in the course of a three day workshop. Following an intoductuctory phase and exchange of information the basic installation of the central Nagios instance was completed in one day thanks to OMD. During this step a number of new Checkmk Checks have been developed (for example the monitoring of Bintec routers) that have been incorporated into the official version of Checkmk. Following a few week's conception, and development of the connections to the stores, with a further four-day on site appointment during which the big rollout to all stores was prepared on the first two days, and started around the end of the second day. Within six hours a total of 1,200 Nagios systems were installed. That's three new installations per minute! After the installation in the stores, on the third day all systems were connected to the central instance. In the same action the Geomap with the information from all stores was put into service.
The speed and achievement of objectives, from the planning to the implementation surprised us. The work together with Mathias Kettner's company was professional and determined from the beginning.
With little effort and working together as partners an appropriate solution for EDEKA was developed and successfully implemented. The complete project was developed on a licence cost free software basis. Simultaneously knowledge was imparted and the customer's know-how in the area of monitoring was enhanced.
Contact: Lars Michelsen
Following Mathias Kettner, Lars Michelsen was the second employee at tribe29 GmbH. He was involved in the development of Checkmk from the beginning. Today he is Head of Engineering