[Home] [Blog] [Contact] - [Talks] [Bio] [Customers]
twitter linkedin youtube github rss

Patrick Debois

Monitoring Wonderland Survey - Nagios the Mighty Beast

Controlling the tool everybody hates, but still uses

This blog post mainly contains my findings on getting data in and out of Nagios. That data can be status information, performance information and notifications. At the end there are some pointers on ruby integration with Pingdom and Jira

The idea is similar to my previous blogposting Monitoring Wonderland Survey - Metrics - API - Gateways: I want to share/open up this data for others to consume, preferably on a bus like system and using events instead of polling.

Nagios - IN

Writing Checks in Ruby

If you want to get data into Nagios, you have to write a check. These are some options for doing this in ruby:

Projects that link testing and monitoring:

Transporting check results

Nagios has many ways to collect the results of these checks:

You can test NRPE with the standalone NRPE runner

And maybe schedule the Nagios NRPE checks with Rundeck

If you don’t like the spawning of separate ruby processes for each check, you can leverage Metis:https://github.com/krobertson/metis

Transport over a bus system

Instead of using the traditional provided interfaces, people are starting to send the check information over a bus for further handling:

Look ma, no Nagios Server needed

Some people have taken an alternative approach, re-using the checks libraries but reusing them in their own framework.

Nagios - OUT

Reading Status

As there is no official API to extract status information from Nagios, people have been implementing various ways of getting to the data:

Scraping the UI

Well if we really have to …

Parsing status.dat file

All status information from Nagios is stored in the .dat file, so several people have started writing parsers for it, and exposing it as an API

Nagios-Dashboard parses the nagios status.dat file & sends the current status to clients via an HTML5 WebSocket. The dashboard monitors the status.dat file for changes, any modifications trigger client updates (push). Nagios-Dashboard queries a Chef server or Opscode platform organization for additional host information.

Parsing the log files

Using Checkmklivestatus

A better option to get adhoc status is to query Nagios via CheckMK_Livestatus http://mathias-kettner.de/checkmk_livestatus.html It is a Nagios Event Broker that hooks directly into the Nagios Core, allowing it direct acces to all structures and commands NEB’s are very powerfull, and for more information look a the Nagios book - event broker section

Tools that use this API :

Quering the database/NDO

An alternative NEB handler is NDO Utils, NDO2DB. It stores all the information into a database. Or on using NDO2FS - NDO in Json or filesystem on a filesystem.

Hooking into performancehandler

RI Pienaar shows us how to hook into a process-service-perfdata handler and logs that information to a file:

The advantage is that we can get the information evented instead of having to poll the status of information. In other words ready to be put on message bus for others to read.

Listening in to events with NEB/Message queue

In order to get the events as fast as possible, I looked into using a NEB to put information on a message queue directly.

I found the following sample code:

Marius Sturm had Nagios-ZMQ https://github.com/mariussturm/nagios-zmq that allowed to get the events directly on the queue. I extended to not only read the check results or performance data, but also the notifications.

It seems Icinga is taking a similar approach with the Icinga - ZMQ - icingamq. This to enable High performance Large Scale Monitoring

An interesting difference is that is will also expose the CheckMklivestatus API directly over ZeroMQ

Adding Hosts dynamically

A bit of side track, but one of the things a lot of people struggle with is dynamically adding hosts/servers to Nagios , without restarting it. The following are links that kind of try to solve this problem, but none solves it completely. It seems most people solve this by some interaction with a Configuration Management system and a system inventory.

To read the config and write the configs, people have writing various parsers:

The reload problem doesn’t look like an easy one to solve: one could create NEB that manipulates the memory host/service structures but it will also need to persist that on disk. If anyone has a good solution, please let us know!

Notification handling

There a lot more problems with Nagios, but people still use it’s notification and acknowledgement system. Some interesting things I found:

Pingdom

If pingdom is your game, here are some API to information to Pingdom, and read the status

I could not find a way to make this evented , we’ll have to create

Jira Notificiation

I found 4 libraries to interact with Jira - from ruby:

Conclusion:

  • We can get a long way to automate getting data in and out of Nagios
  • Exposing the API through the Livestatus works really well
  • Using the NEB Nagios-ZMQ will allow us to get the information in an evented way
  • Adding hosts dynamically still seems to be an issue

By listening in on the events over a queue, we could create a self-servicing for nagios events similar to Tattle, which does the same for Graphite:

Next blogpost we’ll move up the stack a bit and start investigating options for application and enduser usage metrics.