Ep. 44: Working with the Datadog integration in Checkmk

[0:00:00] Welcome to the Checkmk Channel. Today, I'll be talking about Datadog integration
[0:00:14] This integration allows us to correlate the data from two systems, speed up the root cause analysis, and facilitate the communication between Checkmk and Datadog users.
[0:00:25] This is what I'm going to show you now. So, without further ado, let's get started. Here is a fresh Checkmk instance. Before we enable the datadog integration, we need to fill out some prerequisites, which are, creation of a host, uploading the API key and application key to the password store, and last but not the least, creation of the event console rules.
[0:00:50] So, to create a host, we need to click on Setup, Hosts, Add host monitoring. Define the host name here. We do not want to ping this host, so we set it to No IP. Rest everything can remain as it is. Click on Save & go to folder.
[0:01:10] Before we activate the changes, we can also upload our API key and application key. To do that, we need to click on Passwords, Add password. Under the Unique ID, we can define the Checkmk API key.
[0:01:28] You can give it any name. In this case, I'm giving it checkmk_api_key. And you can also add a comment under Password. Now we need to go back to the Datadog UI to fetch the API key and upload it to the password store.
[0:01:48] Let's go to the Datadog Web UI. You need to select your organization. Go to the Organization Settings. API Keys. Click on this API Key, copy it, and paste it here. Similar way, you need to repeat this process for the application key.
[0:02:14] You need to again give it a unique name. And now we need to paste the application key the same way.
[0:02:26] Select the Application Keys, copy it, and paste it inside the Password field. So, we can save it. And now we can activate the changes.
[0:02:44] So, we have already completed two prerequisites. Now we will go to the Event Console to fill out the third prerequisite. It's under the Setup menu.
[0:02:56] You already have a default pack. Under the default pack, you can create the different filters or the different rules, which will be used to process the messages, the events that you get from the Datadog.
[0:03:16] I have already configured these rules before in order to save time. At the end of the day what they are doing is they are trying to look for a pattern, and based on the pattern they are going to design a state.
[0:03:32] Similar way for the logs. Whenever the logs is fine with this message "No data received." It will be setting the state as this log. Okay, so our prerequisites are done. We can now go back to the Datadog integration.
[0:03:49] It is under the Setup module. You need to go under Other integrations, click on Datadog, click on Add rule. And here you need to define why are you doing it. It's a Datadog integration if you all know.
[0:04:07] Under Comment, define comment why are you doing it. Link to the Documentation URL. And then we can directly jump on to filling out the form related to the Datadog instance.
[0:04:20] We already uploaded the API Key and Application Key in the password store, we just need to import it here. So, we have imported now the API Key. We will also import the Application Key.
[0:04:36] My API host will remain as it is because I have my datadog instance running in Europe So, for that the API endpoint or the API host name is api.datadog hq.eu. I don't need to define a proxy because  I can connect from my Checkmk instance to the Datadog instance without any proxy in between. So, I will leave out this option.
[0:05:07] Then we can straight away go to the fetching the monitors. If you click on monitors, you can do some restrictions based on tags or monitor tags. Whenever those monitors are fetched from Datadog inside Checkmk, they will be created as separate service check.
[0:05:28] So, if you have three monitors in your organization or depending on your filter, you will get that as many monitors created as service checks. Now let's select the events. The maximum age of fetched events is 10 hours max. I can set it to one hour.
[0:05:56] And the restriction and showing the tags in the Event Console can stay as it is. The other settings like Syslog facility, Syslog priority, this is quite standard to the Event Console rules. I will also not add any text of events, so that can also stay as it is.
[0:06:18] Now we will select the Fetch logs. Here again we need to define a maximum age for fetch logs. By default it is 10 minutes, so I will also change it to one hour.
[0:06:32] And in the Log search query you can basically define the Datadog log search syntax. So, the way you search inside the log stream or in the the log section of the Datadog you define a query. Same sort of query you can also define it here. 
[0:06:52] I'm not going to define anything because I want to fetch everything that is available in the log stream. And when it comes to indexes, I have defined a *, that means it will use all the indexes.
[0:07:07] Rest anything about Syslog facility, Service level remain as it is. The Text of forwarded events can be constructed based on the attributes of the log entry. It's a Name and a Key.
[0:07:20] So, I have given here as message, message. You can further add more elements to it. Now you can assign this truth to the host that you created in the previous steps and save this rule.
[0:07:37] Before we activate the changes, we can already do the discovery from this page by clicking on the host name, Save & go to service configuration.
[0:07:48] And you will now see the list of service checks that those elements that we selected inside the rule. It creates some service checks. If you look at the Datadog Events, it says already it has forwarded 4 events to the Event Console.
[0:08:06] Regarding the Datadog logs, it is showing that it forwarded 0 logs to the Event Console. In case of a monitor, this is the name of my monitor inside Datadog organization. This is the Monitor Disk space usage on {{device.name}}/{{host.name}}.
[0:08:25] These variables that you see with the {}, this is how they are defined inside Datadog, and exactly we use the same name here as well inside Checkmk. And it has the Overall state, it has a threshold, critical, and warning. And now we can click on Accept all.
[0:08:49] Once we click on Accept all, you can then activate the changes, so that these changes are made permanent. And on the next refresh you can see all your service checks. When you click on datadog-instance now, this error message that you see here, it's because this Checkmk host doesn't have an IP address, so it is just a temporary warning.
[0:09:16] We can click on Reschedule check, and this will now show you the checks that were created by defining this rule.
[0:09:27] As we can see on the screen, it has it has forwarded 0 events and forwarded 0 logs.
[0:09:33] Let's have a look in the Event Console. When we look in the Event Console, the list of the events, we see that there are a mix of logs plus the events.
[0:09:47] All the logs are being matched by the datadog_logs rule that we define inside the Event Console as a prerequisite. And then there were three event rules which were datadog_rule_warn, datadog_rule_triggered, and datadog_rule_recovered.
[0:10:05] So, when we look at the events, we can see that an event was triggered. Then it went to the warning state two times, and after that it was recovered. If we look at the logs, there were some logs sent at particular timestamp with denied failed error messages. And they were also fetched from the Datadog log stream and shown inside Checkmk Event Console.
[0:10:36] So, that's it for today. That was a brief overview of the Datadog integration. Thanks for watching. Please don't forget to like and subscribe. And see you next time.

Want to know more about Checkmk? Join us for our Introduction to Checkmk Webinar

Register now

More Checkmk Videos