Distributed monitoring with Checkmk consists of a central site and at least one remote site. The structure of such a distributed monitoring, especially the connection of the remote sites, is essentially a standard network procedure. However, it is also necessary to take the characteristics of Checkmk into account.
For this article, for demonstration purposes I will work with the simple setup of a distributed monitoring, consisting of a central site and several remote sites, including synchronization of the configuration. In addition, in these best practices for connecting remote sites, we limit ourselves to IPv4, since experience shows that organizations currently operate either a pure IPv4 intranet or have a dual-stack intranet with IPv4 and IPv6 – but not a pure IPv6 intranet.
The simplest example is a flat network where routing works between the hosts running the Checkmk sites. This is the most typical setup in a small organization without branch offices, so distributed monitoring is not usually necessary.
Companies with branch offices typically operate a company-wide VPN, which is usually at least orchestrated by a central IT. This means that it assigns an exclusive subnet to the individual locations, which in turn allows for simple routing. Distributed monitoring is not absolutely necessary in such a network. If the branch offices have few devices (rule of thumb: a single-digit number of devices) or manage without a hypervisor, remote sites are usually dispensed with.
Site2Site-VPN
With a managed service provider, things look different again. Here, there are usually several different connections to customer networks. A typical connection type is Site2Site VPN, which lands at a dedicated firewall for customer connections. A common problem in this case is that the customer networks have private IPv4 ranges that overlap. Routing directly into these private networks is therefore generally not possible. Alternatively, NAT (Network Address Translation) can be used here.
Another typical connection type is DNAT (Destination Network Address Translation) on the customer firewall, i.e. without a VPN connection to the customer. In this case, dedicated high ports on the customer firewall are forwarded to the remote site host. The source IPs are limited to the smallest possible network of the MSP; only incoming connections are needed. In practice, this can look like the following:
In this example, the highports on the firewall are set in the 655x ID range, which is typical for Checkmk. The last digit corresponds to the connection type (see above list).
The big advantage of this variant is that overlaps of customer networks do not matter, which allows a service provider to roll out standardized intranet installations.
The remote site host should be queried in sufficient detail as a status host, even if the configuration of a status host in the site settings can be dispensed with when using live proxy. To do this, tunnel the agent output via SSH, as in the example – and thus dispense with another DNAT connection. The host check command should accordingly be changed to the status of the Checkmk service.
The optional Notification Connection is used to process notifications centrally, for example in a central ticket system of the MSP. Checkmk currently does not provide an encryption solution here, and this will not change until the next version (Checkmk 2.1). In the meantime, however, Stunnel, for example, may be a solution.
DNAT for robust MSP monitoring
For robust and heterogeneous MSP monitoring, it is advisable to use DNAT as standard. If individual customers insist on VPN with incompatible customer networks, solutions are conceivable that additionally include intermediate hosts. In this case, this intermediate host, which is located in the provider's data center, can communicate via VPN with the remote site host at the customer's site. The intermediate host offers this connection to the central site via port forwarding.
The approaches discussed so far have always assumed that access from an address range of the MSP, initiated by the central Checkmk site, is possible and permitted. However, for security reasons, this may not be desirable in some cases. The CMCDump method is suitable for this task, which in turn entails restrictions.
With this method, a cron job creates an archive of the current overall status at a specified time interval, for example every minute. This created archive file is then transmitted to the central site and imported there. A prerequisite for CMCDump, however, is that outgoing traffic is permitted and that it is also suitable for transporting the archive file, such as a .tgz. The transfer can be realized via scp to a file host or via e-mail. However, the individual transport solution must be scripted.
It can be stated that, depending on the framework conditions, solutions of any required complexity are possible. At this point, managed service providers are advised to define possible implementation variants and prerequisites in advance and to document and communicate these adequately in order to avoid creating extra work for the customer.